You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rajgopal Vaithiyanathan <ra...@gmail.com> on 2012/04/19 04:02:11 UTC

OutputSchema for EvalFunc

Hey all,

Sorry if i  sound naive, but how should one implement outputSchema of  an
eval Func that returns tuple.
The way i do it is ,

public Schema outputSchema(Schema input) {
    List<FieldSchema> list = new ArrayList<FieldSchema>();
    list.add(new FieldSchema("one", DataType.CHARARRAY));
    list.add(new FieldSchema("two", DataType.CHARARRAY))

    return new Schema(list);
}

but in the front end, If i use
  B = foreach A generate flatten(FUNC());
  describe B
I get the schema like this:
    { ( one:chararray, two:chararray ) }
Now i use a flatten on this like :
    B = foreach A generate flatten(FUNC());
 and i get { null::one : chararray, null::two : chararray }

The question is,
How should i implement the outputSchema so that i get the schema like { one
: chararray, two : chararray }  // NOTE: without the parenthesis

Re: OutputSchema for EvalFunc

Posted by Rajgopal Vaithiyanathan <ra...@gmail.com>.
Awesome :) Thanks a lot..

On Thu, Apr 19, 2012 at 10:26 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Haha, when I say naive I don't mean bad... plenty of my scripts use that
> approach, and often it's unavoidable, so it's good to understand.
>
> as far as the naming issue, when you flatten it is usually a good idea to
> give the resultant columns a name. so your example would become:
>
>
> A = load 'one.txt' as (a:int, b:int);
> B = load 'two.txt' as (a:int, b:int);
> A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b);
> B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (x,y);
> C = join A_1 by a full, B_1 by x;
> describe C
>
> That will get rid of the org.apache.pig.builtin.totuple_b etc. But let's
> say that you still want them to have the same name, you can do that:
>
>
> A = load 'one.txt' as (a:int, b:int);
> B = load 'two.txt' as (a:int, b:int);
> A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b);
> B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (a,b);
> C = join A_1 by a full, B_1 by a;
> describe C
>
> And in the join result, you can disambiguate A_1::a and B_1::a, and so on.
>
> 2012/4/19 Rajgopal Vaithiyanathan <ra...@gmail.com>
>
> > I knew it would sound naive :P I didn't even know a schema parser
> exists.!
> >
> > `it can only return a tuple, which you then flatten into columns.`
> >
> >
> > Isn't this bad..?   For example see this, ( for simplicity i'm using
> > TOTUPLE instead of my UDF.. )
> >
> > A = load 'one.txt' as (a:int, b:int);
> > B = load 'two.txt' as (a:int, b:int);
> > A_1 = foreach A generate flatten(TOTUPLE(a,b));
> > B_1 = foreach B generate flatten(TOTUPLE(a,b));
> > C = join A_1 by a full, B_1 by a;
> > describe C
> >
> > The schema description is like this.
> >
> > C: {A_1::org.apache.pig.builtin.totuple_b_18::a:
> > int,A_1::org.apache.pig.builtin.totuple_b_18::b:
> > int,B_1::org.apache.pig.builtin.totuple_b_19::a:
> > int,B_1::org.apache.pig.builtin.totuple_b_19::b: int}
> >
> > and totuple_b_** in the description obviously changes every time  i
> > describe because it is based on a counter....
> > Now how do i disambiguate between The A_1's a,b and B_1's a,b ?
> >
> >
> > On Thu, Apr 19, 2012 at 12:07 PM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> >
> > > Dmitriy's suggestion is spot on, but just to be pedantic, you'd do:
> > >
> > > public Schema outputSchema(Schema input) {
> > >   List<FieldSchema> list = new ArrayList<FieldSchema>();
> > >   list.add(new FieldSchema("one", DataType.CHARARRAY));
> > >   list.add(new FieldSchema("two", DataType.CHARARRAY))
> > >
> > >    return new Schema(new Schema.FieldSchema("t", new Schema(list),
> > > DataType.TUPLE));
> > > }
> > >
> > > That said, in your question you asked: "how can you get it without the
> > > parenthesis." Short answer is that you can't. A UDF can't return
> multiple
> > > columns -- it can only return a tuple, which you then flatten into
> > columns.
> > >
> > > 2012/4/18 Dmitriy Ryaboy <dv...@gmail.com>
> > >
> > > > It's messy. Easier to use the schema parser:
> > > >
> > > >
> > >
> >
> org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");
> > > >
> > > > Even easier to use the @OutputSchema annotation (coming in 0.11 I
> > > believe)
> > > >
> > > > -D
> > > >
> > > >
> > > > On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
> > > > <ra...@gmail.com> wrote:
> > > > > Hey all,
> > > > >
> > > > > Sorry if i  sound naive, but how should one implement outputSchema
> of
> > >  an
> > > > > eval Func that returns tuple.
> > > > > The way i do it is ,
> > > > >
> > > > > public Schema outputSchema(Schema input) {
> > > > >    List<FieldSchema> list = new ArrayList<FieldSchema>();
> > > > >    list.add(new FieldSchema("one", DataType.CHARARRAY));
> > > > >    list.add(new FieldSchema("two", DataType.CHARARRAY))
> > > > >
> > > > >    return new Schema(list);
> > > > > }
> > > > >
> > > > > but in the front end, If i use
> > > > >  B = foreach A generate flatten(FUNC());
> > > > >  describe B
> > > > > I get the schema like this:
> > > > >    { ( one:chararray, two:chararray ) }
> > > > > Now i use a flatten on this like :
> > > > >    B = foreach A generate flatten(FUNC());
> > > > >  and i get { null::one : chararray, null::two : chararray }
> > > > >
> > > > > The question is,
> > > > > How should i implement the outputSchema so that i get the schema
> > like {
> > > > one
> > > > > : chararray, two : chararray }  // NOTE: without the parenthesis
> > > >
> > >
> >
> >
> > Raj :)
> >
>



-- 
Thanks and Regards,
Rajgopal Vaithiyanathan.

Re: OutputSchema for EvalFunc

Posted by Jonathan Coveney <jc...@gmail.com>.
Haha, when I say naive I don't mean bad... plenty of my scripts use that
approach, and often it's unavoidable, so it's good to understand.

as far as the naming issue, when you flatten it is usually a good idea to
give the resultant columns a name. so your example would become:


A = load 'one.txt' as (a:int, b:int);
B = load 'two.txt' as (a:int, b:int);
A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b);
B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (x,y);
C = join A_1 by a full, B_1 by x;
describe C

That will get rid of the org.apache.pig.builtin.totuple_b etc. But let's
say that you still want them to have the same name, you can do that:


A = load 'one.txt' as (a:int, b:int);
B = load 'two.txt' as (a:int, b:int);
A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b);
B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (a,b);
C = join A_1 by a full, B_1 by a;
describe C

And in the join result, you can disambiguate A_1::a and B_1::a, and so on.

2012/4/19 Rajgopal Vaithiyanathan <ra...@gmail.com>

> I knew it would sound naive :P I didn't even know a schema parser exists.!
>
> `it can only return a tuple, which you then flatten into columns.`
>
>
> Isn't this bad..?   For example see this, ( for simplicity i'm using
> TOTUPLE instead of my UDF.. )
>
> A = load 'one.txt' as (a:int, b:int);
> B = load 'two.txt' as (a:int, b:int);
> A_1 = foreach A generate flatten(TOTUPLE(a,b));
> B_1 = foreach B generate flatten(TOTUPLE(a,b));
> C = join A_1 by a full, B_1 by a;
> describe C
>
> The schema description is like this.
>
> C: {A_1::org.apache.pig.builtin.totuple_b_18::a:
> int,A_1::org.apache.pig.builtin.totuple_b_18::b:
> int,B_1::org.apache.pig.builtin.totuple_b_19::a:
> int,B_1::org.apache.pig.builtin.totuple_b_19::b: int}
>
> and totuple_b_** in the description obviously changes every time  i
> describe because it is based on a counter....
> Now how do i disambiguate between The A_1's a,b and B_1's a,b ?
>
>
> On Thu, Apr 19, 2012 at 12:07 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > Dmitriy's suggestion is spot on, but just to be pedantic, you'd do:
> >
> > public Schema outputSchema(Schema input) {
> >   List<FieldSchema> list = new ArrayList<FieldSchema>();
> >   list.add(new FieldSchema("one", DataType.CHARARRAY));
> >   list.add(new FieldSchema("two", DataType.CHARARRAY))
> >
> >    return new Schema(new Schema.FieldSchema("t", new Schema(list),
> > DataType.TUPLE));
> > }
> >
> > That said, in your question you asked: "how can you get it without the
> > parenthesis." Short answer is that you can't. A UDF can't return multiple
> > columns -- it can only return a tuple, which you then flatten into
> columns.
> >
> > 2012/4/18 Dmitriy Ryaboy <dv...@gmail.com>
> >
> > > It's messy. Easier to use the schema parser:
> > >
> > >
> >
> org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");
> > >
> > > Even easier to use the @OutputSchema annotation (coming in 0.11 I
> > believe)
> > >
> > > -D
> > >
> > >
> > > On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
> > > <ra...@gmail.com> wrote:
> > > > Hey all,
> > > >
> > > > Sorry if i  sound naive, but how should one implement outputSchema of
> >  an
> > > > eval Func that returns tuple.
> > > > The way i do it is ,
> > > >
> > > > public Schema outputSchema(Schema input) {
> > > >    List<FieldSchema> list = new ArrayList<FieldSchema>();
> > > >    list.add(new FieldSchema("one", DataType.CHARARRAY));
> > > >    list.add(new FieldSchema("two", DataType.CHARARRAY))
> > > >
> > > >    return new Schema(list);
> > > > }
> > > >
> > > > but in the front end, If i use
> > > >  B = foreach A generate flatten(FUNC());
> > > >  describe B
> > > > I get the schema like this:
> > > >    { ( one:chararray, two:chararray ) }
> > > > Now i use a flatten on this like :
> > > >    B = foreach A generate flatten(FUNC());
> > > >  and i get { null::one : chararray, null::two : chararray }
> > > >
> > > > The question is,
> > > > How should i implement the outputSchema so that i get the schema
> like {
> > > one
> > > > : chararray, two : chararray }  // NOTE: without the parenthesis
> > >
> >
>
>
> Raj :)
>

Re: OutputSchema for EvalFunc

Posted by Rajgopal Vaithiyanathan <ra...@gmail.com>.
I knew it would sound naive :P I didn't even know a schema parser exists.!

`it can only return a tuple, which you then flatten into columns.`


Isn't this bad..?   For example see this, ( for simplicity i'm using
TOTUPLE instead of my UDF.. )

A = load 'one.txt' as (a:int, b:int);
B = load 'two.txt' as (a:int, b:int);
A_1 = foreach A generate flatten(TOTUPLE(a,b));
B_1 = foreach B generate flatten(TOTUPLE(a,b));
C = join A_1 by a full, B_1 by a;
describe C

The schema description is like this.

C: {A_1::org.apache.pig.builtin.totuple_b_18::a:
int,A_1::org.apache.pig.builtin.totuple_b_18::b:
int,B_1::org.apache.pig.builtin.totuple_b_19::a:
int,B_1::org.apache.pig.builtin.totuple_b_19::b: int}

and totuple_b_** in the description obviously changes every time  i
describe because it is based on a counter....
Now how do i disambiguate between The A_1's a,b and B_1's a,b ?


On Thu, Apr 19, 2012 at 12:07 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Dmitriy's suggestion is spot on, but just to be pedantic, you'd do:
>
> public Schema outputSchema(Schema input) {
>   List<FieldSchema> list = new ArrayList<FieldSchema>();
>   list.add(new FieldSchema("one", DataType.CHARARRAY));
>   list.add(new FieldSchema("two", DataType.CHARARRAY))
>
>    return new Schema(new Schema.FieldSchema("t", new Schema(list),
> DataType.TUPLE));
> }
>
> That said, in your question you asked: "how can you get it without the
> parenthesis." Short answer is that you can't. A UDF can't return multiple
> columns -- it can only return a tuple, which you then flatten into columns.
>
> 2012/4/18 Dmitriy Ryaboy <dv...@gmail.com>
>
> > It's messy. Easier to use the schema parser:
> >
> >
> org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");
> >
> > Even easier to use the @OutputSchema annotation (coming in 0.11 I
> believe)
> >
> > -D
> >
> >
> > On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
> > <ra...@gmail.com> wrote:
> > > Hey all,
> > >
> > > Sorry if i  sound naive, but how should one implement outputSchema of
>  an
> > > eval Func that returns tuple.
> > > The way i do it is ,
> > >
> > > public Schema outputSchema(Schema input) {
> > >    List<FieldSchema> list = new ArrayList<FieldSchema>();
> > >    list.add(new FieldSchema("one", DataType.CHARARRAY));
> > >    list.add(new FieldSchema("two", DataType.CHARARRAY))
> > >
> > >    return new Schema(list);
> > > }
> > >
> > > but in the front end, If i use
> > >  B = foreach A generate flatten(FUNC());
> > >  describe B
> > > I get the schema like this:
> > >    { ( one:chararray, two:chararray ) }
> > > Now i use a flatten on this like :
> > >    B = foreach A generate flatten(FUNC());
> > >  and i get { null::one : chararray, null::two : chararray }
> > >
> > > The question is,
> > > How should i implement the outputSchema so that i get the schema like {
> > one
> > > : chararray, two : chararray }  // NOTE: without the parenthesis
> >
>


Raj :)

Re: OutputSchema for EvalFunc

Posted by Jonathan Coveney <jc...@gmail.com>.
Dmitriy's suggestion is spot on, but just to be pedantic, you'd do:

public Schema outputSchema(Schema input) {
   List<FieldSchema> list = new ArrayList<FieldSchema>();
   list.add(new FieldSchema("one", DataType.CHARARRAY));
   list.add(new FieldSchema("two", DataType.CHARARRAY))

   return new Schema(new Schema.FieldSchema("t", new Schema(list),
DataType.TUPLE));
}

That said, in your question you asked: "how can you get it without the
parenthesis." Short answer is that you can't. A UDF can't return multiple
columns -- it can only return a tuple, which you then flatten into columns.

2012/4/18 Dmitriy Ryaboy <dv...@gmail.com>

> It's messy. Easier to use the schema parser:
>
> org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");
>
> Even easier to use the @OutputSchema annotation (coming in 0.11 I believe)
>
> -D
>
>
> On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
> <ra...@gmail.com> wrote:
> > Hey all,
> >
> > Sorry if i  sound naive, but how should one implement outputSchema of  an
> > eval Func that returns tuple.
> > The way i do it is ,
> >
> > public Schema outputSchema(Schema input) {
> >    List<FieldSchema> list = new ArrayList<FieldSchema>();
> >    list.add(new FieldSchema("one", DataType.CHARARRAY));
> >    list.add(new FieldSchema("two", DataType.CHARARRAY))
> >
> >    return new Schema(list);
> > }
> >
> > but in the front end, If i use
> >  B = foreach A generate flatten(FUNC());
> >  describe B
> > I get the schema like this:
> >    { ( one:chararray, two:chararray ) }
> > Now i use a flatten on this like :
> >    B = foreach A generate flatten(FUNC());
> >  and i get { null::one : chararray, null::two : chararray }
> >
> > The question is,
> > How should i implement the outputSchema so that i get the schema like {
> one
> > : chararray, two : chararray }  // NOTE: without the parenthesis
>

Re: OutputSchema for EvalFunc

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
It's messy. Easier to use the schema parser:
org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");

Even easier to use the @OutputSchema annotation (coming in 0.11 I believe)

-D


On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
<ra...@gmail.com> wrote:
> Hey all,
>
> Sorry if i  sound naive, but how should one implement outputSchema of  an
> eval Func that returns tuple.
> The way i do it is ,
>
> public Schema outputSchema(Schema input) {
>    List<FieldSchema> list = new ArrayList<FieldSchema>();
>    list.add(new FieldSchema("one", DataType.CHARARRAY));
>    list.add(new FieldSchema("two", DataType.CHARARRAY))
>
>    return new Schema(list);
> }
>
> but in the front end, If i use
>  B = foreach A generate flatten(FUNC());
>  describe B
> I get the schema like this:
>    { ( one:chararray, two:chararray ) }
> Now i use a flatten on this like :
>    B = foreach A generate flatten(FUNC());
>  and i get { null::one : chararray, null::two : chararray }
>
> The question is,
> How should i implement the outputSchema so that i get the schema like { one
> : chararray, two : chararray }  // NOTE: without the parenthesis