You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Martin Goodson <ma...@qubitproducts.com> on 2012/11/14 17:17:44 UTC

Accessing tuple field names from within a python udf

I normally deal with very large tuples with many fields. Its a pain to deal
with these in python udfs since I can't figure out a way to input schemas
into the udf. I have to hard code the column number in the UDFs, which is a
maintenance nightmare.

It seems that java UDFs receive the full tuple in their exec methods so
that the correct fields can be identified, whereas python UDFs only receive
lists objects (with field names stripped). Is there any way to get the
behaviour of python UDFs to conform to the java behaviour?


Thanks for any ideas
Martin

Re: Accessing tuple field names from within a python udf

Posted by Jonathan Coveney <jc...@gmail.com>.

In the java interface, there is a getInputSchema() method. You could make
this available in the python side of things. This would be a useful
addition.


2012/11/16 Martin Goodson <ma...@qubitproducts.com>

> Unfortunately I've realised that boundscript.describe doesn't return a
> string. It returns void but prints to stdout. This means I have to go
> through a rather painful process of calling a separate python process that
> calls boundscript.describe and then capture the stdout of that process in
> order to obtain the schema. I don't know why it doesn't return a string.
> Maybe there is an easier way I am missing here. If people have any ideas
> for  a more elegant solution I would be happy to contribute develop it and
> contribute the code.
>
> Martin
>
>
>
>
>
>
>
> On 15 November 2012 20:20, Jonathan Coveney <jc...@gmail.com> wrote:
>
> > Martin,
> >
> > That is a reasonable workaround. Even in java UDF's, you can't directly
> > access fields by name. Tuples are indexed only by numbers. Using the
> Schema
> > is how I would do it.
> >
> >
> > 2012/11/14 Martin Goodson <ma...@qubitproducts.com>
> >
> > > Sorry to reply to my question post but I've found a workaround that I
> > > thought I should put here:
> > >
> > > use embedded pig
> > > access the schema with boundscript.describe().
> > > input the schema as a parameter into the udf call.
> > >
> > > Thanks
> > > Martin
> > >
> > >
> > >
> > >
> > > On 14 November 2012 16:17, Martin Goodson <ma...@qubitproducts.com>
> > > wrote:
> > >
> > > > I normally deal with very large tuples with many fields. Its a pain
> to
> > > > deal with these in python udfs since I can't figure out a way to
> input
> > > > schemas into the udf. I have to hard code the column number in the
> > UDFs,
> > > > which is a maintenance nightmare.
> > > >
> > > > It seems that java UDFs receive the full tuple in their exec methods
> so
> > > > that the correct fields can be identified, whereas python UDFs only
> > > receive
> > > > lists objects (with field names stripped). Is there any way to get
> the
> > > > behaviour of python UDFs to conform to the java behaviour?
> > > >
> > > >
> > > > Thanks for any ideas
> > > > Martin
> > > >
> > > >
> > >
> >
>

Re: Accessing tuple field names from within a python udf

Posted by Martin Goodson <ma...@qubitproducts.com>.

Unfortunately I've realised that boundscript.describe doesn't return a
string. It returns void but prints to stdout. This means I have to go
through a rather painful process of calling a separate python process that
calls boundscript.describe and then capture the stdout of that process in
order to obtain the schema. I don't know why it doesn't return a string.
Maybe there is an easier way I am missing here. If people have any ideas
for  a more elegant solution I would be happy to contribute develop it and
contribute the code.

Martin







On 15 November 2012 20:20, Jonathan Coveney <jc...@gmail.com> wrote:

> Martin,
>
> That is a reasonable workaround. Even in java UDF's, you can't directly
> access fields by name. Tuples are indexed only by numbers. Using the Schema
> is how I would do it.
>
>
> 2012/11/14 Martin Goodson <ma...@qubitproducts.com>
>
> > Sorry to reply to my question post but I've found a workaround that I
> > thought I should put here:
> >
> > use embedded pig
> > access the schema with boundscript.describe().
> > input the schema as a parameter into the udf call.
> >
> > Thanks
> > Martin
> >
> >
> >
> >
> > On 14 November 2012 16:17, Martin Goodson <ma...@qubitproducts.com>
> > wrote:
> >
> > > I normally deal with very large tuples with many fields. Its a pain to
> > > deal with these in python udfs since I can't figure out a way to input
> > > schemas into the udf. I have to hard code the column number in the
> UDFs,
> > > which is a maintenance nightmare.
> > >
> > > It seems that java UDFs receive the full tuple in their exec methods so
> > > that the correct fields can be identified, whereas python UDFs only
> > receive
> > > lists objects (with field names stripped). Is there any way to get the
> > > behaviour of python UDFs to conform to the java behaviour?
> > >
> > >
> > > Thanks for any ideas
> > > Martin
> > >
> > >
> >
>

Re: Accessing tuple field names from within a python udf

Posted by Jonathan Coveney <jc...@gmail.com>.

Martin,

That is a reasonable workaround. Even in java UDF's, you can't directly
access fields by name. Tuples are indexed only by numbers. Using the Schema
is how I would do it.


2012/11/14 Martin Goodson <ma...@qubitproducts.com>

> Sorry to reply to my question post but I've found a workaround that I
> thought I should put here:
>
> use embedded pig
> access the schema with boundscript.describe().
> input the schema as a parameter into the udf call.
>
> Thanks
> Martin
>
>
>
>
> On 14 November 2012 16:17, Martin Goodson <ma...@qubitproducts.com>
> wrote:
>
> > I normally deal with very large tuples with many fields. Its a pain to
> > deal with these in python udfs since I can't figure out a way to input
> > schemas into the udf. I have to hard code the column number in the UDFs,
> > which is a maintenance nightmare.
> >
> > It seems that java UDFs receive the full tuple in their exec methods so
> > that the correct fields can be identified, whereas python UDFs only
> receive
> > lists objects (with field names stripped). Is there any way to get the
> > behaviour of python UDFs to conform to the java behaviour?
> >
> >
> > Thanks for any ideas
> > Martin
> >
> >
>

Re: Accessing tuple field names from within a python udf

Posted by Martin Goodson <ma...@qubitproducts.com>.

Sorry to reply to my question post but I've found a workaround that I
thought I should put here:

use embedded pig
access the schema with boundscript.describe().
input the schema as a parameter into the udf call.

Thanks
Martin




On 14 November 2012 16:17, Martin Goodson <ma...@qubitproducts.com> wrote:

> I normally deal with very large tuples with many fields. Its a pain to
> deal with these in python udfs since I can't figure out a way to input
> schemas into the udf. I have to hard code the column number in the UDFs,
> which is a maintenance nightmare.
>
> It seems that java UDFs receive the full tuple in their exec methods so
> that the correct fields can be identified, whereas python UDFs only receive
> lists objects (with field names stripped). Is there any way to get the
> behaviour of python UDFs to conform to the java behaviour?
>
>
> Thanks for any ideas
> Martin
>
>