You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Andrew Clegg <an...@gmail.com> on 2011/10/03 17:27:37 UTC

outputSchema for UDF EvalFunc returning a DataBag

Hi,

When you have a UDF that returns a bag, and you're writing the
outputSchema method, do you have to explicitly include the mandatory
'container' tuple within the bag, or is this implicit?

i.e. if I'm returning a bag of ints, do I have to do:

return new Schema(
  new FieldSchema(null,
    new Schema(
      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));

Or do I have to explicitly define a tuple like so:

return new Schema(
  new FieldSchema(null,
    new Schema(
      new FieldSchema(null,
        new Schema(
          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
DataType.BAG));

The docs seem pretty vague on this, and you're allowed to do either.
My feeling would be that if the first form was illegal, you wouldn't
be allowed to create a schema like that, but this may be wishful
thinking.

Thanks,

Andrew.

-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Raghu's being a little modest.. said understanding is based on getting
ElephantBird to work with arbitrarily nested structures for both versions of
Pig. Chances are he's right :-).

D

On Mon, Oct 3, 2011 at 2:56 PM, Raghu Angadi <an...@gmail.com> wrote:

> my understanding is that Pig 0.8 expects the first form and Pig 0.9
> requires
> the second.
>
> Raghu.
>
> On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> <an...@gmail.com>wrote:
>
> > Hi,
> >
> > When you have a UDF that returns a bag, and you're writing the
> > outputSchema method, do you have to explicitly include the mandatory
> > 'container' tuple within the bag, or is this implicit?
> >
> > i.e. if I'm returning a bag of ints, do I have to do:
> >
> > return new Schema(
> >  new FieldSchema(null,
> >    new Schema(
> >      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >
> > Or do I have to explicitly define a tuple like so:
> >
> > return new Schema(
> >  new FieldSchema(null,
> >    new Schema(
> >      new FieldSchema(null,
> >        new Schema(
> >          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
> > DataType.BAG));
> >
> > The docs seem pretty vague on this, and you're allowed to do either.
> > My feeling would be that if the first form was illegal, you wouldn't
> > be allowed to create a schema like that, but this may be wishful
> > thinking.
> >
> > Thanks,
> >
> > Andrew.
> >
> > --
> >
> > http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >
>

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Raghu Angadi <an...@gmail.com>.

After multiple attempts this worked :

grunt> x = load 'x' as *(B: {t: (f1:chararray, f2:int)} )* ;
grunt> describe x;
x: {B: {t: (f1: chararray,f2: int)}}
grunt> y = foreach x generate FLATTEN(B);
grunt> describe y;
y: {B::f1: chararray,B::f2: int}
grunt>


On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg
<an...@gmail.com>wrote:

> Yep, getSchemaFromString is what I was looking for, but I can't get it
> to generate a schema (for unit test purposes) that matches what I get
> inside my script during a real run.
>
> As an example, say I have a file like this:
>
> foo\t2
> bar\t3
> baz\t3
> marge\t4
> homer\t4
>
> and I load it like this:
>
> infile = load 'test.txt' as (name:chararray, weight:int);
> grouped = group infile all;
> bucketed = foreach grouped generate flatten(Buckets(infile));
>
> the outputSchema method of my UDF (Buckets) gets called with a schema
> that stringifies like so:
>
> {infile: {name: chararray,weight: int}}
>
> i.e. it has a single field, which is a bag, containing two elements
> directly (no wrapping tuple, presumably because this is Pig 0.8.1?).
>
> (sidenote, I guess the outermost {}s are a display convention, as
> there's only one bag there)
>
> When I'm unit-testing the UDF's outputSchema method, I'd like to
> generate exactly that schema.
>
> But if I call getSchemaFromString like this:
>
> Utils.getSchemaFromString("B: {f1: chararray, f2: int}")
>
> It throws a parser error:
>
> Encountered " "{" "{ "" at line 1, column 4.
> Was expecting one of:
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>
> Two questions I guess.
>
> (1) Is there a way of generating a schema like that via Utils?
>
> (2) ... or is this schema actually wrong, and I'm looking at a symptom
> of https://issues.apache.org/jira/browse/PIG-767 that would behave
> differently if I was in Pig 0.9.0?
>
> Many thanks,
>
> Andrew.
>
>
> On 4 October 2011 00:14, Raghu Angadi <ra...@apache.org> wrote:
> > Utils.getSchemaFromString() seems like exactly what you want (
> > from org_apache_pig_impl_util ).
> >
> > Raghu.
> >
> > [btw. my two previous attempts to send to the list got rejected as spam ]
> >
> > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> > <an...@gmail.com>wrote:
> >
> >> Thanks Raghu (and Dmitry).
> >>
> >> Could this maybe get added to the docs page on UDFs? (Apologies if
> >> it's there already and I missed it.)
> >>
> >> Also -- it's a bit cumbersome writing all these nested Schema and
> >> FieldSchema constructors, especially when you're writing tests for
> >> UDFs with flexible schema support.
> >>
> >> I was wondering if it would be practical to reuse whatever code the
> >> front-end uses to parse schema descriptions from load statements in
> >> scripts. Is this a silly idea? If it isn't silly, does anyone know
> >> where I need to look for that code?
> >>
> >>
> >> On 3 October 2011 22:56, Raghu Angadi <an...@gmail.com> wrote:
> >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
> >> requires
> >> > the second.
> >> >
> >> > Raghu.
> >> >
> >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> >> > <an...@gmail.com>wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> When you have a UDF that returns a bag, and you're writing the
> >> >> outputSchema method, do you have to explicitly include the mandatory
> >> >> 'container' tuple within the bag, or is this implicit?
> >> >>
> >> >> i.e. if I'm returning a bag of ints, do I have to do:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >> >>
> >> >> Or do I have to explicitly define a tuple like so:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null,
> >> >>        new Schema(
> >> >>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
> >> >> DataType.BAG));
> >> >>
> >> >> The docs seem pretty vague on this, and you're allowed to do either.
> >> >> My feeling would be that if the first form was illegal, you wouldn't
> >> >> be allowed to create a schema like that, but this may be wishful
> >> >> thinking.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Andrew.
> >> >>
> >> >> --
> >> >>
> >> >> http://tinyurl.com/andrew-clegg-linkedin |
> >> http://twitter.com/andrew_clegg
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >>
> >
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

this seems to work:

    Utils.getSchemaFromString("(b:bag{f1: chararray, f2: int})");

On Tue, Oct 4, 2011 at 6:01 AM, Andrew Clegg
<an...@gmail.com>wrote:

> Yep, getSchemaFromString is what I was looking for, but I can't get it
> to generate a schema (for unit test purposes) that matches what I get
> inside my script during a real run.
>
> As an example, say I have a file like this:
>
> foo\t2
> bar\t3
> baz\t3
> marge\t4
> homer\t4
>
> and I load it like this:
>
> infile = load 'test.txt' as (name:chararray, weight:int);
> grouped = group infile all;
> bucketed = foreach grouped generate flatten(Buckets(infile));
>
> the outputSchema method of my UDF (Buckets) gets called with a schema
> that stringifies like so:
>
> {infile: {name: chararray,weight: int}}
>
> i.e. it has a single field, which is a bag, containing two elements
> directly (no wrapping tuple, presumably because this is Pig 0.8.1?).
>
> (sidenote, I guess the outermost {}s are a display convention, as
> there's only one bag there)
>
> When I'm unit-testing the UDF's outputSchema method, I'd like to
> generate exactly that schema.
>
> But if I call getSchemaFromString like this:
>
> Utils.getSchemaFromString("B: {f1: chararray, f2: int}")
>
> It throws a parser error:
>
> Encountered " "{" "{ "" at line 1, column 4.
> Was expecting one of:
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>    "int" ...
>    "long" ...
>    "float" ...
>    "double" ...
>    "chararray" ...
>    "bytearray" ...
>
> Two questions I guess.
>
> (1) Is there a way of generating a schema like that via Utils?
>
> (2) ... or is this schema actually wrong, and I'm looking at a symptom
> of https://issues.apache.org/jira/browse/PIG-767 that would behave
> differently if I was in Pig 0.9.0?
>
> Many thanks,
>
> Andrew.
>
>
> On 4 October 2011 00:14, Raghu Angadi <ra...@apache.org> wrote:
> > Utils.getSchemaFromString() seems like exactly what you want (
> > from org_apache_pig_impl_util ).
> >
> > Raghu.
> >
> > [btw. my two previous attempts to send to the list got rejected as spam ]
> >
> > On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> > <an...@gmail.com>wrote:
> >
> >> Thanks Raghu (and Dmitry).
> >>
> >> Could this maybe get added to the docs page on UDFs? (Apologies if
> >> it's there already and I missed it.)
> >>
> >> Also -- it's a bit cumbersome writing all these nested Schema and
> >> FieldSchema constructors, especially when you're writing tests for
> >> UDFs with flexible schema support.
> >>
> >> I was wondering if it would be practical to reuse whatever code the
> >> front-end uses to parse schema descriptions from load statements in
> >> scripts. Is this a silly idea? If it isn't silly, does anyone know
> >> where I need to look for that code?
> >>
> >>
> >> On 3 October 2011 22:56, Raghu Angadi <an...@gmail.com> wrote:
> >> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
> >> requires
> >> > the second.
> >> >
> >> > Raghu.
> >> >
> >> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> >> > <an...@gmail.com>wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> When you have a UDF that returns a bag, and you're writing the
> >> >> outputSchema method, do you have to explicitly include the mandatory
> >> >> 'container' tuple within the bag, or is this implicit?
> >> >>
> >> >> i.e. if I'm returning a bag of ints, do I have to do:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >> >>
> >> >> Or do I have to explicitly define a tuple like so:
> >> >>
> >> >> return new Schema(
> >> >>  new FieldSchema(null,
> >> >>    new Schema(
> >> >>      new FieldSchema(null,
> >> >>        new Schema(
> >> >>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
> >> >> DataType.BAG));
> >> >>
> >> >> The docs seem pretty vague on this, and you're allowed to do either.
> >> >> My feeling would be that if the first form was illegal, you wouldn't
> >> >> be allowed to create a schema like that, but this may be wishful
> >> >> thinking.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Andrew.
> >> >>
> >> >> --
> >> >>
> >> >> http://tinyurl.com/andrew-clegg-linkedin |
> >> http://twitter.com/andrew_clegg
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >>
> >
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Andrew Clegg <an...@gmail.com>.

Yep, getSchemaFromString is what I was looking for, but I can't get it
to generate a schema (for unit test purposes) that matches what I get
inside my script during a real run.

As an example, say I have a file like this:

foo\t2
bar\t3
baz\t3
marge\t4
homer\t4

and I load it like this:

infile = load 'test.txt' as (name:chararray, weight:int);
grouped = group infile all;
bucketed = foreach grouped generate flatten(Buckets(infile));

the outputSchema method of my UDF (Buckets) gets called with a schema
that stringifies like so:

{infile: {name: chararray,weight: int}}

i.e. it has a single field, which is a bag, containing two elements
directly (no wrapping tuple, presumably because this is Pig 0.8.1?).

(sidenote, I guess the outermost {}s are a display convention, as
there's only one bag there)

When I'm unit-testing the UDF's outputSchema method, I'd like to
generate exactly that schema.

But if I call getSchemaFromString like this:

Utils.getSchemaFromString("B: {f1: chararray, f2: int}")

It throws a parser error:

Encountered " "{" "{ "" at line 1, column 4.
Was expecting one of:
    "int" ...
    "long" ...
    "float" ...
    "double" ...
    "chararray" ...
    "bytearray" ...
    "int" ...
    "long" ...
    "float" ...
    "double" ...
    "chararray" ...
    "bytearray" ...

Two questions I guess.

(1) Is there a way of generating a schema like that via Utils?

(2) ... or is this schema actually wrong, and I'm looking at a symptom
of https://issues.apache.org/jira/browse/PIG-767 that would behave
differently if I was in Pig 0.9.0?

Many thanks,

Andrew.


On 4 October 2011 00:14, Raghu Angadi <ra...@apache.org> wrote:
> Utils.getSchemaFromString() seems like exactly what you want (
> from org_apache_pig_impl_util ).
>
> Raghu.
>
> [btw. my two previous attempts to send to the list got rejected as spam ]
>
> On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> <an...@gmail.com>wrote:
>
>> Thanks Raghu (and Dmitry).
>>
>> Could this maybe get added to the docs page on UDFs? (Apologies if
>> it's there already and I missed it.)
>>
>> Also -- it's a bit cumbersome writing all these nested Schema and
>> FieldSchema constructors, especially when you're writing tests for
>> UDFs with flexible schema support.
>>
>> I was wondering if it would be practical to reuse whatever code the
>> front-end uses to parse schema descriptions from load statements in
>> scripts. Is this a silly idea? If it isn't silly, does anyone know
>> where I need to look for that code?
>>
>>
>> On 3 October 2011 22:56, Raghu Angadi <an...@gmail.com> wrote:
>> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
>> requires
>> > the second.
>> >
>> > Raghu.
>> >
>> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
>> > <an...@gmail.com>wrote:
>> >
>> >> Hi,
>> >>
>> >> When you have a UDF that returns a bag, and you're writing the
>> >> outputSchema method, do you have to explicitly include the mandatory
>> >> 'container' tuple within the bag, or is this implicit?
>> >>
>> >> i.e. if I'm returning a bag of ints, do I have to do:
>> >>
>> >> return new Schema(
>> >>  new FieldSchema(null,
>> >>    new Schema(
>> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
>> >>
>> >> Or do I have to explicitly define a tuple like so:
>> >>
>> >> return new Schema(
>> >>  new FieldSchema(null,
>> >>    new Schema(
>> >>      new FieldSchema(null,
>> >>        new Schema(
>> >>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
>> >> DataType.BAG));
>> >>
>> >> The docs seem pretty vague on this, and you're allowed to do either.
>> >> My feeling would be that if the first form was illegal, you wouldn't
>> >> be allowed to create a schema like that, but this may be wishful
>> >> thinking.
>> >>
>> >> Thanks,
>> >>
>> >> Andrew.
>> >>
>> >> --
>> >>
>> >> http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>> >>
>> >
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Raghu Angadi <ra...@apache.org>.

Utils.getSchemaFromString() seems like exactly what you want (
from org_apache_pig_impl_util ).

Raghu.

[btw. my two previous attempts to send to the list got rejected as spam ]

On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
<an...@gmail.com>wrote:

> Thanks Raghu (and Dmitry).
>
> Could this maybe get added to the docs page on UDFs? (Apologies if
> it's there already and I missed it.)
>
> Also -- it's a bit cumbersome writing all these nested Schema and
> FieldSchema constructors, especially when you're writing tests for
> UDFs with flexible schema support.
>
> I was wondering if it would be practical to reuse whatever code the
> front-end uses to parse schema descriptions from load statements in
> scripts. Is this a silly idea? If it isn't silly, does anyone know
> where I need to look for that code?
>
>
> On 3 October 2011 22:56, Raghu Angadi <an...@gmail.com> wrote:
> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
> requires
> > the second.
> >
> > Raghu.
> >
> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> > <an...@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> When you have a UDF that returns a bag, and you're writing the
> >> outputSchema method, do you have to explicitly include the mandatory
> >> 'container' tuple within the bag, or is this implicit?
> >>
> >> i.e. if I'm returning a bag of ints, do I have to do:
> >>
> >> return new Schema(
> >>  new FieldSchema(null,
> >>    new Schema(
> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
> >>
> >> Or do I have to explicitly define a tuple like so:
> >>
> >> return new Schema(
> >>  new FieldSchema(null,
> >>    new Schema(
> >>      new FieldSchema(null,
> >>        new Schema(
> >>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
> >> DataType.BAG));
> >>
> >> The docs seem pretty vague on this, and you're allowed to do either.
> >> My feeling would be that if the first form was illegal, you wouldn't
> >> be allowed to create a schema like that, but this may be wishful
> >> thinking.
> >>
> >> Thanks,
> >>
> >> Andrew.
> >>
> >> --
> >>
> >> http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >>
> >
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Andrew Clegg <an...@gmail.com>.

Thanks Raghu (and Dmitry).

Could this maybe get added to the docs page on UDFs? (Apologies if
it's there already and I missed it.)

Also -- it's a bit cumbersome writing all these nested Schema and
FieldSchema constructors, especially when you're writing tests for
UDFs with flexible schema support.

I was wondering if it would be practical to reuse whatever code the
front-end uses to parse schema descriptions from load statements in
scripts. Is this a silly idea? If it isn't silly, does anyone know
where I need to look for that code?


On 3 October 2011 22:56, Raghu Angadi <an...@gmail.com> wrote:
> my understanding is that Pig 0.8 expects the first form and Pig 0.9 requires
> the second.
>
> Raghu.
>
> On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
> <an...@gmail.com>wrote:
>
>> Hi,
>>
>> When you have a UDF that returns a bag, and you're writing the
>> outputSchema method, do you have to explicitly include the mandatory
>> 'container' tuple within the bag, or is this implicit?
>>
>> i.e. if I'm returning a bag of ints, do I have to do:
>>
>> return new Schema(
>>  new FieldSchema(null,
>>    new Schema(
>>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
>>
>> Or do I have to explicitly define a tuple like so:
>>
>> return new Schema(
>>  new FieldSchema(null,
>>    new Schema(
>>      new FieldSchema(null,
>>        new Schema(
>>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
>> DataType.BAG));
>>
>> The docs seem pretty vague on this, and you're allowed to do either.
>> My feeling would be that if the first form was illegal, you wouldn't
>> be allowed to create a schema like that, but this may be wishful
>> thinking.
>>
>> Thanks,
>>
>> Andrew.
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: outputSchema for UDF EvalFunc returning a DataBag

Posted by Raghu Angadi <an...@gmail.com>.

my understanding is that Pig 0.8 expects the first form and Pig 0.9 requires
the second.

Raghu.

On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
<an...@gmail.com>wrote:

> Hi,
>
> When you have a UDF that returns a bag, and you're writing the
> outputSchema method, do you have to explicitly include the mandatory
> 'container' tuple within the bag, or is this implicit?
>
> i.e. if I'm returning a bag of ints, do I have to do:
>
> return new Schema(
>  new FieldSchema(null,
>    new Schema(
>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
>
> Or do I have to explicitly define a tuple like so:
>
> return new Schema(
>  new FieldSchema(null,
>    new Schema(
>      new FieldSchema(null,
>        new Schema(
>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
> DataType.BAG));
>
> The docs seem pretty vague on this, and you're allowed to do either.
> My feeling would be that if the first form was illegal, you wouldn't
> be allowed to create a schema like that, but this may be wishful
> thinking.
>
> Thanks,
>
> Andrew.
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>