You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/01/10 18:56:03 UTC

Holding onto info when doing a udf on a bag

So I have a udf, let's call it myudf.bag2bag, which takes a bag which
contains "prop," and creates a new bag of tuples based on that.

I have data in the form of

id    prop    other1    other2

If all I care about is running the udf, obviously I can do

A = LOAD 'file' AS (id, prop, other1, other2);
B = GROUP A BY id;
C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));

And all is fine

But what do I do if I want to hold on to the other data, especially if you
don't know how much there will be (from a bag2bag perspective)

My thought is that in bag2bag, you can pass in a touple of "extras," which
you then pass back, ie

C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1,
A.other2))));

I'm just not sure how I would specify the schema for this, in such a way
that any number of entries could be in the tuple, and then you could just
sort of reference them later.

Is this possible?

Re: Holding onto info when doing a udf on a bag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think it's interesting to see what motivates different companies to choose
Pig, what issues they have encountered and how they solved them, the general
architecture, etc.

There are a few slide decks floating on the internet about how Pig is being
used in production at Yahoo, Twitter, LinkedIn, Mendeley, Meebo, and a bunch
of others, you can try looking at them for inspiration.

Curious by what you mean when you say "serious data" :)

D

On Mon, Jan 10, 2011 at 5:41 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> If we ever do anything really worth writing about, maybe I'll ask the
> higher
> ups if we can do a case study... I'm not sure what sort of use information
> would best benefit the Pig community, any thoughts?
>
> But I would love to give back, and show that Pig can handle some serious
> data.
>
> 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>
> > Absolutely.
> > Would love to hear what you are doing once it goes in production by the
> > way.
> >
> > D
> >
> > On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> >
> > > Thank you Julien.
> > >
> > > Once again I want to thank everyone for their help... I know that I use
> > the
> > > listserv a lot, but you guys have really helped me turn Pig into a
> > powerful
> > > tool in my workplace, and I know that Pig benefits from being used on
> > large
> > > production systems.
> > >
> > > Jon
> > >
> > > 2011/1/10 Julien Le Dem <le...@yahoo-inc.com>
> > >
> > > > Hi Jonathan,
> > > > It's input.getField(1).schema
> > > > You can get the schema of your input by overriding Schema
> > > > outputSchema(Schema) but it looks like you figured that out.
> > > > outputSchema is called on the client side so if you want to make use
> of
> > > the
> > > > input schema in exec(Tuple) you need to pass it in the UDF context:
> > > > Properties properties =
> > > > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > > > properties.put("inputSchema", inputSchema);
> > > > Julien
> > > >
> > > > On 1/10/11 1:25 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
> > > >
> > > > I was able to get it work (I just didn't override the schema), but
> I'd
> > > > rather like it to have the schema so that describes and whatnot work.
> > > >
> > > > Is there no way, given a Schema with fields, to get the Schema of one
> > of
> > > > those fields? I can try to make a hack or something, but is there a
> > > > limitation as to why you can't do Schema inner = input.getSchema(1)
> > > > (instead
> > > > of getField, which returns a Schema.FieldSchema, a getSchema function
> > > which
> > > > gave the actual schema of the given object?).
> > > >
> > > > As always, I appreciate the help.
> > > >
> > > > 2011/1/10 Jonathan Coveney <jc...@gmail.com>
> > > >
> > > > > I was under the impression that for Bag->Bag functions, providing
> the
> > > > > schema made things much faster?
> > > > >
> > > > >
> > > > > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > > > >
> > > > >> Heck, if you know the schema at runtime, you could pass in a
> string
> > > > >> describing the schema as another argument.
> > > > >> Or pass it in during initialization:
> > > > >>
> > > > >> define udfWithSchema myUdf('a:int, b:chararrahy')
> > > > >>
> > > > >> What do you need the schema for, exactly?
> > > > >>
> > > > >> D
> > > > >>
> > > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <
> > > jcoveney@gmail.com
> > > > >> >wrote:
> > > > >>
> > > > >> > I thought about that, but I do not know how long the tuple is.
> > This
> > > > >> isn't
> > > > >> > an
> > > > >> > issue from a calculation perspective, I suppose, as long as you
> > make
> > > > >> sure
> > > > >> > that prop is the first thing in the bag. But from a
> schema...hmm,
> > I
> > > > >> guess
> > > > >> > you could just grab the schema of the other elements and build
> it
> > > > >> > accordingly?
> > > > >> >
> > > > >> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > > > >> >
> > > > >> > > Jonathan, can't you just pass the bag A in?
> > > > >> > >
> > > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> > > > jcoveney@gmail.com
> > > > >> > > >wrote:
> > > > >> > >
> > > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a
> > bag
> > > > >> which
> > > > >> > > > contains "prop," and creates a new bag of tuples based on
> > that.
> > > > >> > > >
> > > > >> > > > I have data in the form of
> > > > >> > > >
> > > > >> > > > id    prop    other1    other2
> > > > >> > > >
> > > > >> > > > If all I care about is running the udf, obviously I can do
> > > > >> > > >
> > > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> > > > >> > > > B = GROUP A BY id;
> > > > >> > > > C = FOREACH B GENERATE group,
> FLATTEN(myudf.bag2bag(A.prop));
> > > > >> > > >
> > > > >> > > > And all is fine
> > > > >> > > >
> > > > >> > > > But what do I do if I want to hold on to the other data,
> > > > especially
> > > > >> if
> > > > >> > > you
> > > > >> > > > don't know how much there will be (from a bag2bag
> perspective)
> > > > >> > > >
> > > > >> > > > My thought is that in bag2bag, you can pass in a touple of
> > > > "extras,"
> > > > >> > > which
> > > > >> > > > you then pass back, ie
> > > > >> > > >
> > > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> > > > >> (A,other1,
> > > > >> > > > A.other2))));
> > > > >> > > >
> > > > >> > > > I'm just not sure how I would specify the schema for this,
> in
> > > such
> > > > a
> > > > >> > way
> > > > >> > > > that any number of entries could be in the tuple, and then
> you
> > > > could
> > > > >> > just
> > > > >> > > > sort of reference them later.
> > > > >> > > >
> > > > >> > > > Is this possible?
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Jonathan Coveney <jc...@gmail.com>.

If we ever do anything really worth writing about, maybe I'll ask the higher
ups if we can do a case study... I'm not sure what sort of use information
would best benefit the Pig community, any thoughts?

But I would love to give back, and show that Pig can handle some serious
data.

2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>

> Absolutely.
> Would love to hear what you are doing once it goes in production by the
> way.
>
> D
>
> On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > Thank you Julien.
> >
> > Once again I want to thank everyone for their help... I know that I use
> the
> > listserv a lot, but you guys have really helped me turn Pig into a
> powerful
> > tool in my workplace, and I know that Pig benefits from being used on
> large
> > production systems.
> >
> > Jon
> >
> > 2011/1/10 Julien Le Dem <le...@yahoo-inc.com>
> >
> > > Hi Jonathan,
> > > It's input.getField(1).schema
> > > You can get the schema of your input by overriding Schema
> > > outputSchema(Schema) but it looks like you figured that out.
> > > outputSchema is called on the client side so if you want to make use of
> > the
> > > input schema in exec(Tuple) you need to pass it in the UDF context:
> > > Properties properties =
> > > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > > properties.put("inputSchema", inputSchema);
> > > Julien
> > >
> > > On 1/10/11 1:25 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
> > >
> > > I was able to get it work (I just didn't override the schema), but I'd
> > > rather like it to have the schema so that describes and whatnot work.
> > >
> > > Is there no way, given a Schema with fields, to get the Schema of one
> of
> > > those fields? I can try to make a hack or something, but is there a
> > > limitation as to why you can't do Schema inner = input.getSchema(1)
> > > (instead
> > > of getField, which returns a Schema.FieldSchema, a getSchema function
> > which
> > > gave the actual schema of the given object?).
> > >
> > > As always, I appreciate the help.
> > >
> > > 2011/1/10 Jonathan Coveney <jc...@gmail.com>
> > >
> > > > I was under the impression that for Bag->Bag functions, providing the
> > > > schema made things much faster?
> > > >
> > > >
> > > > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > > >
> > > >> Heck, if you know the schema at runtime, you could pass in a string
> > > >> describing the schema as another argument.
> > > >> Or pass it in during initialization:
> > > >>
> > > >> define udfWithSchema myUdf('a:int, b:chararrahy')
> > > >>
> > > >> What do you need the schema for, exactly?
> > > >>
> > > >> D
> > > >>
> > > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <
> > jcoveney@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > I thought about that, but I do not know how long the tuple is.
> This
> > > >> isn't
> > > >> > an
> > > >> > issue from a calculation perspective, I suppose, as long as you
> make
> > > >> sure
> > > >> > that prop is the first thing in the bag. But from a schema...hmm,
> I
> > > >> guess
> > > >> > you could just grab the schema of the other elements and build it
> > > >> > accordingly?
> > > >> >
> > > >> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > > >> >
> > > >> > > Jonathan, can't you just pass the bag A in?
> > > >> > >
> > > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> > > jcoveney@gmail.com
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a
> bag
> > > >> which
> > > >> > > > contains "prop," and creates a new bag of tuples based on
> that.
> > > >> > > >
> > > >> > > > I have data in the form of
> > > >> > > >
> > > >> > > > id    prop    other1    other2
> > > >> > > >
> > > >> > > > If all I care about is running the udf, obviously I can do
> > > >> > > >
> > > >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> > > >> > > > B = GROUP A BY id;
> > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> > > >> > > >
> > > >> > > > And all is fine
> > > >> > > >
> > > >> > > > But what do I do if I want to hold on to the other data,
> > > especially
> > > >> if
> > > >> > > you
> > > >> > > > don't know how much there will be (from a bag2bag perspective)
> > > >> > > >
> > > >> > > > My thought is that in bag2bag, you can pass in a touple of
> > > "extras,"
> > > >> > > which
> > > >> > > > you then pass back, ie
> > > >> > > >
> > > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> > > >> (A,other1,
> > > >> > > > A.other2))));
> > > >> > > >
> > > >> > > > I'm just not sure how I would specify the schema for this, in
> > such
> > > a
> > > >> > way
> > > >> > > > that any number of entries could be in the tuple, and then you
> > > could
> > > >> > just
> > > >> > > > sort of reference them later.
> > > >> > > >
> > > >> > > > Is this possible?
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Absolutely.
Would love to hear what you are doing once it goes in production by the way.

D

On Mon, Jan 10, 2011 at 2:59 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Thank you Julien.
>
> Once again I want to thank everyone for their help... I know that I use the
> listserv a lot, but you guys have really helped me turn Pig into a powerful
> tool in my workplace, and I know that Pig benefits from being used on large
> production systems.
>
> Jon
>
> 2011/1/10 Julien Le Dem <le...@yahoo-inc.com>
>
> > Hi Jonathan,
> > It's input.getField(1).schema
> > You can get the schema of your input by overriding Schema
> > outputSchema(Schema) but it looks like you figured that out.
> > outputSchema is called on the client side so if you want to make use of
> the
> > input schema in exec(Tuple) you need to pass it in the UDF context:
> > Properties properties =
> > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > properties.put("inputSchema", inputSchema);
> > Julien
> >
> > On 1/10/11 1:25 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
> >
> > I was able to get it work (I just didn't override the schema), but I'd
> > rather like it to have the schema so that describes and whatnot work.
> >
> > Is there no way, given a Schema with fields, to get the Schema of one of
> > those fields? I can try to make a hack or something, but is there a
> > limitation as to why you can't do Schema inner = input.getSchema(1)
> > (instead
> > of getField, which returns a Schema.FieldSchema, a getSchema function
> which
> > gave the actual schema of the given object?).
> >
> > As always, I appreciate the help.
> >
> > 2011/1/10 Jonathan Coveney <jc...@gmail.com>
> >
> > > I was under the impression that for Bag->Bag functions, providing the
> > > schema made things much faster?
> > >
> > >
> > > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > >
> > >> Heck, if you know the schema at runtime, you could pass in a string
> > >> describing the schema as another argument.
> > >> Or pass it in during initialization:
> > >>
> > >> define udfWithSchema myUdf('a:int, b:chararrahy')
> > >>
> > >> What do you need the schema for, exactly?
> > >>
> > >> D
> > >>
> > >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <
> jcoveney@gmail.com
> > >> >wrote:
> > >>
> > >> > I thought about that, but I do not know how long the tuple is. This
> > >> isn't
> > >> > an
> > >> > issue from a calculation perspective, I suppose, as long as you make
> > >> sure
> > >> > that prop is the first thing in the bag. But from a schema...hmm, I
> > >> guess
> > >> > you could just grab the schema of the other elements and build it
> > >> > accordingly?
> > >> >
> > >> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> > >> >
> > >> > > Jonathan, can't you just pass the bag A in?
> > >> > >
> > >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> > jcoveney@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
> > >> which
> > >> > > > contains "prop," and creates a new bag of tuples based on that.
> > >> > > >
> > >> > > > I have data in the form of
> > >> > > >
> > >> > > > id    prop    other1    other2
> > >> > > >
> > >> > > > If all I care about is running the udf, obviously I can do
> > >> > > >
> > >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> > >> > > > B = GROUP A BY id;
> > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> > >> > > >
> > >> > > > And all is fine
> > >> > > >
> > >> > > > But what do I do if I want to hold on to the other data,
> > especially
> > >> if
> > >> > > you
> > >> > > > don't know how much there will be (from a bag2bag perspective)
> > >> > > >
> > >> > > > My thought is that in bag2bag, you can pass in a touple of
> > "extras,"
> > >> > > which
> > >> > > > you then pass back, ie
> > >> > > >
> > >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> > >> (A,other1,
> > >> > > > A.other2))));
> > >> > > >
> > >> > > > I'm just not sure how I would specify the schema for this, in
> such
> > a
> > >> > way
> > >> > > > that any number of entries could be in the tuple, and then you
> > could
> > >> > just
> > >> > > > sort of reference them later.
> > >> > > >
> > >> > > > Is this possible?
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Jonathan Coveney <jc...@gmail.com>.

Thank you Julien.

Once again I want to thank everyone for their help... I know that I use the
listserv a lot, but you guys have really helped me turn Pig into a powerful
tool in my workplace, and I know that Pig benefits from being used on large
production systems.

Jon

2011/1/10 Julien Le Dem <le...@yahoo-inc.com>

> Hi Jonathan,
> It's input.getField(1).schema
> You can get the schema of your input by overriding Schema
> outputSchema(Schema) but it looks like you figured that out.
> outputSchema is called on the client side so if you want to make use of the
> input schema in exec(Tuple) you need to pass it in the UDF context:
> Properties properties =
> UDFContext.getUDFContext().getUDFProperties(this.getClass());
> properties.put("inputSchema", inputSchema);
> Julien
>
> On 1/10/11 1:25 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>
> I was able to get it work (I just didn't override the schema), but I'd
> rather like it to have the schema so that describes and whatnot work.
>
> Is there no way, given a Schema with fields, to get the Schema of one of
> those fields? I can try to make a hack or something, but is there a
> limitation as to why you can't do Schema inner = input.getSchema(1)
> (instead
> of getField, which returns a Schema.FieldSchema, a getSchema function which
> gave the actual schema of the given object?).
>
> As always, I appreciate the help.
>
> 2011/1/10 Jonathan Coveney <jc...@gmail.com>
>
> > I was under the impression that for Bag->Bag functions, providing the
> > schema made things much faster?
> >
> >
> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> >
> >> Heck, if you know the schema at runtime, you could pass in a string
> >> describing the schema as another argument.
> >> Or pass it in during initialization:
> >>
> >> define udfWithSchema myUdf('a:int, b:chararrahy')
> >>
> >> What do you need the schema for, exactly?
> >>
> >> D
> >>
> >> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <jcoveney@gmail.com
> >> >wrote:
> >>
> >> > I thought about that, but I do not know how long the tuple is. This
> >> isn't
> >> > an
> >> > issue from a calculation perspective, I suppose, as long as you make
> >> sure
> >> > that prop is the first thing in the bag. But from a schema...hmm, I
> >> guess
> >> > you could just grab the schema of the other elements and build it
> >> > accordingly?
> >> >
> >> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> >> >
> >> > > Jonathan, can't you just pass the bag A in?
> >> > >
> >> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <
> jcoveney@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
> >> which
> >> > > > contains "prop," and creates a new bag of tuples based on that.
> >> > > >
> >> > > > I have data in the form of
> >> > > >
> >> > > > id    prop    other1    other2
> >> > > >
> >> > > > If all I care about is running the udf, obviously I can do
> >> > > >
> >> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> >> > > > B = GROUP A BY id;
> >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> >> > > >
> >> > > > And all is fine
> >> > > >
> >> > > > But what do I do if I want to hold on to the other data,
> especially
> >> if
> >> > > you
> >> > > > don't know how much there will be (from a bag2bag perspective)
> >> > > >
> >> > > > My thought is that in bag2bag, you can pass in a touple of
> "extras,"
> >> > > which
> >> > > > you then pass back, ie
> >> > > >
> >> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> >> (A,other1,
> >> > > > A.other2))));
> >> > > >
> >> > > > I'm just not sure how I would specify the schema for this, in such
> a
> >> > way
> >> > > > that any number of entries could be in the tuple, and then you
> could
> >> > just
> >> > > > sort of reference them later.
> >> > > >
> >> > > > Is this possible?
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: Holding onto info when doing a udf on a bag

Posted by Julien Le Dem <le...@yahoo-inc.com>.

Hi Jonathan,
It's input.getField(1).schema
You can get the schema of your input by overriding Schema outputSchema(Schema) but it looks like you figured that out.
outputSchema is called on the client side so if you want to make use of the input schema in exec(Tuple) you need to pass it in the UDF context:
Properties properties = UDFContext.getUDFContext().getUDFProperties(this.getClass());
properties.put("inputSchema", inputSchema);
Julien

On 1/10/11 1:25 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:

I was able to get it work (I just didn't override the schema), but I'd
rather like it to have the schema so that describes and whatnot work.

Is there no way, given a Schema with fields, to get the Schema of one of
those fields? I can try to make a hack or something, but is there a
limitation as to why you can't do Schema inner = input.getSchema(1) (instead
of getField, which returns a Schema.FieldSchema, a getSchema function which
gave the actual schema of the given object?).

As always, I appreciate the help.

2011/1/10 Jonathan Coveney <jc...@gmail.com>

> I was under the impression that for Bag->Bag functions, providing the
> schema made things much faster?
>
>
> 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>
>> Heck, if you know the schema at runtime, you could pass in a string
>> describing the schema as another argument.
>> Or pass it in during initialization:
>>
>> define udfWithSchema myUdf('a:int, b:chararrahy')
>>
>> What do you need the schema for, exactly?
>>
>> D
>>
>> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <jcoveney@gmail.com
>> >wrote:
>>
>> > I thought about that, but I do not know how long the tuple is. This
>> isn't
>> > an
>> > issue from a calculation perspective, I suppose, as long as you make
>> sure
>> > that prop is the first thing in the bag. But from a schema...hmm, I
>> guess
>> > you could just grab the schema of the other elements and build it
>> > accordingly?
>> >
>> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>> >
>> > > Jonathan, can't you just pass the bag A in?
>> > >
>> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jcoveney@gmail.com
>> > > >wrote:
>> > >
>> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
>> which
>> > > > contains "prop," and creates a new bag of tuples based on that.
>> > > >
>> > > > I have data in the form of
>> > > >
>> > > > id    prop    other1    other2
>> > > >
>> > > > If all I care about is running the udf, obviously I can do
>> > > >
>> > > > A = LOAD 'file' AS (id, prop, other1, other2);
>> > > > B = GROUP A BY id;
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
>> > > >
>> > > > And all is fine
>> > > >
>> > > > But what do I do if I want to hold on to the other data, especially
>> if
>> > > you
>> > > > don't know how much there will be (from a bag2bag perspective)
>> > > >
>> > > > My thought is that in bag2bag, you can pass in a touple of "extras,"
>> > > which
>> > > > you then pass back, ie
>> > > >
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
>> (A,other1,
>> > > > A.other2))));
>> > > >
>> > > > I'm just not sure how I would specify the schema for this, in such a
>> > way
>> > > > that any number of entries could be in the tuple, and then you could
>> > just
>> > > > sort of reference them later.
>> > > >
>> > > > Is this possible?
>> > > >
>> > >
>> >
>>
>
>

Re: Holding onto info when doing a udf on a bag

Posted by Jonathan Coveney <jc...@gmail.com>.

I was able to get it work (I just didn't override the schema), but I'd
rather like it to have the schema so that describes and whatnot work.

Is there no way, given a Schema with fields, to get the Schema of one of
those fields? I can try to make a hack or something, but is there a
limitation as to why you can't do Schema inner = input.getSchema(1) (instead
of getField, which returns a Schema.FieldSchema, a getSchema function which
gave the actual schema of the given object?).

As always, I appreciate the help.

2011/1/10 Jonathan Coveney <jc...@gmail.com>

> I was under the impression that for Bag->Bag functions, providing the
> schema made things much faster?
>
>
> 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>
>> Heck, if you know the schema at runtime, you could pass in a string
>> describing the schema as another argument.
>> Or pass it in during initialization:
>>
>> define udfWithSchema myUdf('a:int, b:chararrahy')
>>
>> What do you need the schema for, exactly?
>>
>> D
>>
>> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <jcoveney@gmail.com
>> >wrote:
>>
>> > I thought about that, but I do not know how long the tuple is. This
>> isn't
>> > an
>> > issue from a calculation perspective, I suppose, as long as you make
>> sure
>> > that prop is the first thing in the bag. But from a schema...hmm, I
>> guess
>> > you could just grab the schema of the other elements and build it
>> > accordingly?
>> >
>> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>> >
>> > > Jonathan, can't you just pass the bag A in?
>> > >
>> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jcoveney@gmail.com
>> > > >wrote:
>> > >
>> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag
>> which
>> > > > contains "prop," and creates a new bag of tuples based on that.
>> > > >
>> > > > I have data in the form of
>> > > >
>> > > > id    prop    other1    other2
>> > > >
>> > > > If all I care about is running the udf, obviously I can do
>> > > >
>> > > > A = LOAD 'file' AS (id, prop, other1, other2);
>> > > > B = GROUP A BY id;
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
>> > > >
>> > > > And all is fine
>> > > >
>> > > > But what do I do if I want to hold on to the other data, especially
>> if
>> > > you
>> > > > don't know how much there will be (from a bag2bag perspective)
>> > > >
>> > > > My thought is that in bag2bag, you can pass in a touple of "extras,"
>> > > which
>> > > > you then pass back, ie
>> > > >
>> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
>> (A,other1,
>> > > > A.other2))));
>> > > >
>> > > > I'm just not sure how I would specify the schema for this, in such a
>> > way
>> > > > that any number of entries could be in the tuple, and then you could
>> > just
>> > > > sort of reference them later.
>> > > >
>> > > > Is this possible?
>> > > >
>> > >
>> >
>>
>
>

Re: Holding onto info when doing a udf on a bag

Posted by Jonathan Coveney <jc...@gmail.com>.

I was under the impression that for Bag->Bag functions, providing the schema
made things much faster?

2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>

> Heck, if you know the schema at runtime, you could pass in a string
> describing the schema as another argument.
> Or pass it in during initialization:
>
> define udfWithSchema myUdf('a:int, b:chararrahy')
>
> What do you need the schema for, exactly?
>
> D
>
> On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > I thought about that, but I do not know how long the tuple is. This isn't
> > an
> > issue from a calculation perspective, I suppose, as long as you make sure
> > that prop is the first thing in the bag. But from a schema...hmm, I guess
> > you could just grab the schema of the other elements and build it
> > accordingly?
> >
> > 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
> >
> > > Jonathan, can't you just pass the bag A in?
> > >
> > > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jcoveney@gmail.com
> > > >wrote:
> > >
> > > > So I have a udf, let's call it myudf.bag2bag, which takes a bag which
> > > > contains "prop," and creates a new bag of tuples based on that.
> > > >
> > > > I have data in the form of
> > > >
> > > > id    prop    other1    other2
> > > >
> > > > If all I care about is running the udf, obviously I can do
> > > >
> > > > A = LOAD 'file' AS (id, prop, other1, other2);
> > > > B = GROUP A BY id;
> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> > > >
> > > > And all is fine
> > > >
> > > > But what do I do if I want to hold on to the other data, especially
> if
> > > you
> > > > don't know how much there will be (from a bag2bag perspective)
> > > >
> > > > My thought is that in bag2bag, you can pass in a touple of "extras,"
> > > which
> > > > you then pass back, ie
> > > >
> > > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop,
> (A,other1,
> > > > A.other2))));
> > > >
> > > > I'm just not sure how I would specify the schema for this, in such a
> > way
> > > > that any number of entries could be in the tuple, and then you could
> > just
> > > > sort of reference them later.
> > > >
> > > > Is this possible?
> > > >
> > >
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Heck, if you know the schema at runtime, you could pass in a string
describing the schema as another argument.
Or pass it in during initialization:

define udfWithSchema myUdf('a:int, b:chararrahy')

What do you need the schema for, exactly?

D

On Mon, Jan 10, 2011 at 10:36 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I thought about that, but I do not know how long the tuple is. This isn't
> an
> issue from a calculation perspective, I suppose, as long as you make sure
> that prop is the first thing in the bag. But from a schema...hmm, I guess
> you could just grab the schema of the other elements and build it
> accordingly?
>
> 2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>
>
> > Jonathan, can't you just pass the bag A in?
> >
> > On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> >
> > > So I have a udf, let's call it myudf.bag2bag, which takes a bag which
> > > contains "prop," and creates a new bag of tuples based on that.
> > >
> > > I have data in the form of
> > >
> > > id    prop    other1    other2
> > >
> > > If all I care about is running the udf, obviously I can do
> > >
> > > A = LOAD 'file' AS (id, prop, other1, other2);
> > > B = GROUP A BY id;
> > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> > >
> > > And all is fine
> > >
> > > But what do I do if I want to hold on to the other data, especially if
> > you
> > > don't know how much there will be (from a bag2bag perspective)
> > >
> > > My thought is that in bag2bag, you can pass in a touple of "extras,"
> > which
> > > you then pass back, ie
> > >
> > > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1,
> > > A.other2))));
> > >
> > > I'm just not sure how I would specify the schema for this, in such a
> way
> > > that any number of entries could be in the tuple, and then you could
> just
> > > sort of reference them later.
> > >
> > > Is this possible?
> > >
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Jonathan Coveney <jc...@gmail.com>.

I thought about that, but I do not know how long the tuple is. This isn't an
issue from a calculation perspective, I suppose, as long as you make sure
that prop is the first thing in the bag. But from a schema...hmm, I guess
you could just grab the schema of the other elements and build it
accordingly?

2011/1/10 Dmitriy Ryaboy <dv...@gmail.com>

> Jonathan, can't you just pass the bag A in?
>
> On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > So I have a udf, let's call it myudf.bag2bag, which takes a bag which
> > contains "prop," and creates a new bag of tuples based on that.
> >
> > I have data in the form of
> >
> > id    prop    other1    other2
> >
> > If all I care about is running the udf, obviously I can do
> >
> > A = LOAD 'file' AS (id, prop, other1, other2);
> > B = GROUP A BY id;
> > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
> >
> > And all is fine
> >
> > But what do I do if I want to hold on to the other data, especially if
> you
> > don't know how much there will be (from a bag2bag perspective)
> >
> > My thought is that in bag2bag, you can pass in a touple of "extras,"
> which
> > you then pass back, ie
> >
> > C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1,
> > A.other2))));
> >
> > I'm just not sure how I would specify the schema for this, in such a way
> > that any number of entries could be in the tuple, and then you could just
> > sort of reference them later.
> >
> > Is this possible?
> >
>

Re: Holding onto info when doing a udf on a bag

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Jonathan, can't you just pass the bag A in?

On Mon, Jan 10, 2011 at 9:56 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> So I have a udf, let's call it myudf.bag2bag, which takes a bag which
> contains "prop," and creates a new bag of tuples based on that.
>
> I have data in the form of
>
> id    prop    other1    other2
>
> If all I care about is running the udf, obviously I can do
>
> A = LOAD 'file' AS (id, prop, other1, other2);
> B = GROUP A BY id;
> C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop));
>
> And all is fine
>
> But what do I do if I want to hold on to the other data, especially if you
> don't know how much there will be (from a bag2bag perspective)
>
> My thought is that in bag2bag, you can pass in a touple of "extras," which
> you then pass back, ie
>
> C = FOREACH B GENERATE group, FLATTEN(myudf.bag2bag(A.prop, (A,other1,
> A.other2))));
>
> I'm just not sure how I would specify the schema for this, in such a way
> that any number of entries could be in the tuple, and then you could just
> sort of reference them later.
>
> Is this possible?
>