You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jacob Perkins <ja...@gmail.com> on 2011/02/01 05:46:40 UTC

Get ResourceSchema during putNext in StoreFunc

Trying to write a simple storefunc that makes use of the input data's
field names. Is there a way to gain access to this inside of the call to
putNext? Ostensibly you could set a variable with the schema during the
call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
is null by the time putNext is called. Is there some other way or am I
missing something obvious?

--jacob
@thedatachef


Re: Get ResourceSchema during putNext in StoreFunc

Posted by Dan Harvey <da...@mendeley.com>.
Ah I see, the null I was getting was whilst the map/reduce tasks were
running which was because it's never called there.

I'll have a go at serialising the schema and sending it though the
config which should be fine.

Thanks,

On 1 February 2011 16:28, jacob <ja...@gmail.com> wrote:
> Thanks, that's ultimately what I went with. (Saw how it was done in the
> AvroStorage class). Thought there might be a cleaner/simpler/better way
> I was missing.
>
> --jacob
> @thedatachef
>
> On Tue, 2011-02-01 at 21:22 +0530, Harsh J wrote:
>> I remember facing this problem when trying to implement a Load/Store
>> quite a while ago.
>>
>> The issue (not really an issue I guess) is that checkSchema is a
>> front-end method. One that is used, perhaps multiple times, in the
>> Pig's front-end code. It isn't called by the back-end code of Pig that
>> runs on a given platform (Local or Hadoop).
>>
>> To persist your schema, ensure you put it onto the 'JobConf' (in loose
>> terms). Pig lets you do this by using the UDFContext class for UDFs.
>> Get a UDFContext for your UDF, then set a property in it with a key
>> signifying your schema/other data and the value. Similarly, retrieve
>> it in the other methods using a similar way, wherever you need it
>> (getOutputFormat, putNext, etc.).
>>
>> On Tue, Feb 1, 2011 at 10:16 AM, Jacob Perkins
>> <ja...@gmail.com> wrote:
>> > Trying to write a simple storefunc that makes use of the input data's
>> > field names. Is there a way to gain access to this inside of the call to
>> > putNext? Ostensibly you could set a variable with the schema during the
>> > call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
>> > is null by the time putNext is called. Is there some other way or am I
>> > missing something obvious?
>> >
>> > --jacob
>> > @thedatachef
>> >
>> >
>>
>>
>>
>
>
>



-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Get ResourceSchema during putNext in StoreFunc

Posted by jacob <ja...@gmail.com>.
Thanks, that's ultimately what I went with. (Saw how it was done in the
AvroStorage class). Thought there might be a cleaner/simpler/better way
I was missing.

--jacob
@thedatachef

On Tue, 2011-02-01 at 21:22 +0530, Harsh J wrote:
> I remember facing this problem when trying to implement a Load/Store
> quite a while ago.
> 
> The issue (not really an issue I guess) is that checkSchema is a
> front-end method. One that is used, perhaps multiple times, in the
> Pig's front-end code. It isn't called by the back-end code of Pig that
> runs on a given platform (Local or Hadoop).
> 
> To persist your schema, ensure you put it onto the 'JobConf' (in loose
> terms). Pig lets you do this by using the UDFContext class for UDFs.
> Get a UDFContext for your UDF, then set a property in it with a key
> signifying your schema/other data and the value. Similarly, retrieve
> it in the other methods using a similar way, wherever you need it
> (getOutputFormat, putNext, etc.).
> 
> On Tue, Feb 1, 2011 at 10:16 AM, Jacob Perkins
> <ja...@gmail.com> wrote:
> > Trying to write a simple storefunc that makes use of the input data's
> > field names. Is there a way to gain access to this inside of the call to
> > putNext? Ostensibly you could set a variable with the schema during the
> > call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
> > is null by the time putNext is called. Is there some other way or am I
> > missing something obvious?
> >
> > --jacob
> > @thedatachef
> >
> >
> 
> 
> 



Re: Get ResourceSchema during putNext in StoreFunc

Posted by Harsh J <qw...@gmail.com>.
I remember facing this problem when trying to implement a Load/Store
quite a while ago.

The issue (not really an issue I guess) is that checkSchema is a
front-end method. One that is used, perhaps multiple times, in the
Pig's front-end code. It isn't called by the back-end code of Pig that
runs on a given platform (Local or Hadoop).

To persist your schema, ensure you put it onto the 'JobConf' (in loose
terms). Pig lets you do this by using the UDFContext class for UDFs.
Get a UDFContext for your UDF, then set a property in it with a key
signifying your schema/other data and the value. Similarly, retrieve
it in the other methods using a similar way, wherever you need it
(getOutputFormat, putNext, etc.).

On Tue, Feb 1, 2011 at 10:16 AM, Jacob Perkins
<ja...@gmail.com> wrote:
> Trying to write a simple storefunc that makes use of the input data's
> field names. Is there a way to gain access to this inside of the call to
> putNext? Ostensibly you could set a variable with the schema during the
> call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
> is null by the time putNext is called. Is there some other way or am I
> missing something obvious?
>
> --jacob
> @thedatachef
>
>



-- 
Harsh J
www.harshj.com

Re: Get ResourceSchema during putNext in StoreFunc

Posted by Jacob Perkins <ja...@gmail.com>.
So I don't get null when I read the schema in the checkSchema method. I
set the class's internal schema variable, as in your gist, and it's not
null in the very next call to 'setStoreLocation'. However, it is null on
all later calls to 'setStoreLocation' and any and all calls to putNext.
Not entirely sure when it goes out of scope.

Now, if this was vanilla map-reduce I'd say that checkSchema is being
called once during the initial map-reduce job setup phase and anything
you do in there is not going to be accessible to your later tasks which
are happening on many different machines in the cluster.

You could set the schema with checkSchema and then on the FIRST call to
setStoreLocation you could place the schema in the job's configuration
as a string. What I'm not sure about is exactly how many times
setStoreLocation is actually called. I suspect (any Pig devs wanna help
me out here?) that it's called exactly once per task (ie. during the
call to 'setup()' in vanilla map-reduce land). If that's true then all
you'd have to do is set it the first time then read it on all subsequent
calls to setStoreLocation. Could try it out at least...

--jacob
@thedatachef

On Tue, 2011-02-01 at 15:23 +0000, Dan Harvey wrote:
> This is the same problem I was getting, I've put a snippit of the code
> I as was using here :- https://gist.github.com/804551
> 
> With this I get null whenever I try to read the ResourceSchema object
> in the checkSchema() method.
> 
> I've had a look over the AvroStorage and it seems to assume the
> ResourceSchema won't be null at this point in time so I'm not sure
> what's going on for me.
> Does anyone know if this is the best way to get the schema, or if pig
> will ever send a null schema to the checkSchema method?
> 
> Thanks,
> 
> On 1 February 2011 04:46, Jacob Perkins <ja...@gmail.com> wrote:
> >
> > Trying to write a simple storefunc that makes use of the input data's
> > field names. Is there a way to gain access to this inside of the call to
> > putNext? Ostensibly you could set a variable with the schema during the
> > call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
> > is null by the time putNext is called. Is there some other way or am I
> > missing something obvious?
> >
> > --jacob
> > @thedatachef
> >
> 
> 
> 
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> 
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015



Re: Get ResourceSchema during putNext in StoreFunc

Posted by Dan Harvey <da...@mendeley.com>.
This is the same problem I was getting, I've put a snippit of the code
I as was using here :- https://gist.github.com/804551

With this I get null whenever I try to read the ResourceSchema object
in the checkSchema() method.

I've had a look over the AvroStorage and it seems to assume the
ResourceSchema won't be null at this point in time so I'm not sure
what's going on for me.
Does anyone know if this is the best way to get the schema, or if pig
will ever send a null schema to the checkSchema method?

Thanks,

On 1 February 2011 04:46, Jacob Perkins <ja...@gmail.com> wrote:
>
> Trying to write a simple storefunc that makes use of the input data's
> field names. Is there a way to gain access to this inside of the call to
> putNext? Ostensibly you could set a variable with the schema during the
> call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
> is null by the time putNext is called. Is there some other way or am I
> missing something obvious?
>
> --jacob
> @thedatachef
>



--
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015