You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Weil <ke...@gmail.com> on 2008/10/29 19:12:01 UTC

Accessing schemas in UDFs

Hi,

With the new typing and schema work, will there be (or is there already) a
way to introspect the schema of a given tuple, for example in a UDF?  Since
we specify schemas and field names on load, and they are perpetuated
throughout the execution of a pig script, it seems reasonable for them to be
accessible to UDFs that get called mid-script.

Thanks, and I apologize in advance if this is already available and I've
missed it.

Kevin

Re: Accessing schemas in UDFs

Posted by pi song <pi...@gmail.com>.
Kevin,

I think this case is a bit weird. Generally, UDFs are expected to be
highly-reusable which means you should not rely on the knowledge of the
field names (but instead of data types).
BTW, if you insist on doing this, one way I could think of is to use FOREACH
GENERATE to inject the field name to your UDF

a = LOAD 'myfile' using PigStorage(',') as (v1: long, v2: chararray, v3:
chararray);
a0 = FOREACH a GENERATE v1, v2, "v2"  ;
a1 = FOREACH a GENERATE v1, v3, "v3" ;
b = FILTER a0 by MY_FUNC($0, $1, $2);
c = FILTER a1 by MY_FUNC($0, $1, $2);

Performance-wise this should just add a little bit more of overhead.

Pi

On Thu, Oct 30, 2008 at 8:28 AM, Kevin Weil <ke...@gmail.com> wrote:

> Santosh,
>
> Great, that's about what I was looking for.  But it looks like the field
> names themselves are lost in DataType.determineFieldSchema, is that
> correct?  It always constructs a FieldSchema with null as the first
> parameter.  So if I said
>
> a = LOAD 'myfile' using PigStorage(',') as (v1: long, v2: chararray, v3:
> chararray);
> b = FILTER a by MY_FUNC(v1, v2);
> c = FILTER a by MY_FUNC(v1, v3);
>
> then in MY_FUNC, I'd have no way to tell which of the two calls I was
> getting, right?  I couldn't, for example, determine that in the first call,
> the second field of my input tuple was named v2, while in the second call,
> the second field was named v3?
>
> This is a totally contrived example -- what I'm really looking at is an
> extensible storage format for processed data that can use named fields
> without going outside of pig (in a custom StoreFunc).
>
> Thanks,
> Kevin
>
> On Wed, Oct 29, 2008 at 1:27 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> >wrote:
>
> > At run time, the schema of each tuple can be determined using
> > determineFieldSchema(Object o) in DataType.java
> >
> > http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pi
> > g/data/DataType.java<
> http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/DataType.java
> >
> >
> > The input schema for the UDF is passed onto the outputSchema method in
> > the UDF.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Kevin Weil [mailto:kevinweil@gmail.com]
> > Sent: Wednesday, October 29, 2008 11:12 AM
> > To: pig-user@incubator.apache.org
> > Subject: Accessing schemas in UDFs
> >
> > Hi,
> >
> > With the new typing and schema work, will there be (or is there already)
> > a
> > way to introspect the schema of a given tuple, for example in a UDF?
> > Since
> > we specify schemas and field names on load, and they are perpetuated
> > throughout the execution of a pig script, it seems reasonable for them
> > to be
> > accessible to UDFs that get called mid-script.
> >
> > Thanks, and I apologize in advance if this is already available and I've
> > missed it.
> >
> > Kevin
> >
>

Re: Accessing schemas in UDFs

Posted by Kevin Weil <ke...@gmail.com>.
Santosh,

Great, that's about what I was looking for.  But it looks like the field
names themselves are lost in DataType.determineFieldSchema, is that
correct?  It always constructs a FieldSchema with null as the first
parameter.  So if I said

a = LOAD 'myfile' using PigStorage(',') as (v1: long, v2: chararray, v3:
chararray);
b = FILTER a by MY_FUNC(v1, v2);
c = FILTER a by MY_FUNC(v1, v3);

then in MY_FUNC, I'd have no way to tell which of the two calls I was
getting, right?  I couldn't, for example, determine that in the first call,
the second field of my input tuple was named v2, while in the second call,
the second field was named v3?

This is a totally contrived example -- what I'm really looking at is an
extensible storage format for processed data that can use named fields
without going outside of pig (in a custom StoreFunc).

Thanks,
Kevin

On Wed, Oct 29, 2008 at 1:27 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> At run time, the schema of each tuple can be determined using
> determineFieldSchema(Object o) in DataType.java
>
> http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pi
> g/data/DataType.java<http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/DataType.java>
>
> The input schema for the UDF is passed onto the outputSchema method in
> the UDF.
>
> Santhosh
>
> -----Original Message-----
> From: Kevin Weil [mailto:kevinweil@gmail.com]
> Sent: Wednesday, October 29, 2008 11:12 AM
> To: pig-user@incubator.apache.org
> Subject: Accessing schemas in UDFs
>
> Hi,
>
> With the new typing and schema work, will there be (or is there already)
> a
> way to introspect the schema of a given tuple, for example in a UDF?
> Since
> we specify schemas and field names on load, and they are perpetuated
> throughout the execution of a pig script, it seems reasonable for them
> to be
> accessible to UDFs that get called mid-script.
>
> Thanks, and I apologize in advance if this is already available and I've
> missed it.
>
> Kevin
>

RE: Accessing schemas in UDFs

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
At run time, the schema of each tuple can be determined using
determineFieldSchema(Object o) in DataType.java

http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pi
g/data/DataType.java

The input schema for the UDF is passed onto the outputSchema method in
the UDF.

Santhosh 

-----Original Message-----
From: Kevin Weil [mailto:kevinweil@gmail.com] 
Sent: Wednesday, October 29, 2008 11:12 AM
To: pig-user@incubator.apache.org
Subject: Accessing schemas in UDFs

Hi,

With the new typing and schema work, will there be (or is there already)
a
way to introspect the schema of a given tuple, for example in a UDF?
Since
we specify schemas and field names on load, and they are perpetuated
throughout the execution of a pig script, it seems reasonable for them
to be
accessible to UDFs that get called mid-script.

Thanks, and I apologize in advance if this is already available and I've
missed it.

Kevin