You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Scott Carey <sc...@richrelevance.com> on 2010/06/02 11:23:01 UTC

Re: Pig loader 0.6 to 0.7 migration guide

So, here are some things I'm struggling with now:

In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.   
Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.

It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.  
Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.   

* The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
* UDFContext's documentation needs help too --
** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses. 

On May 24, 2010, at 11:06 AM, Alan Gates wrote:

> Scott,
> 
> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370 
>  If you have a chance take a look and let me know if it deals with  
> the issues you have or if more work is needed.
> 
> Alan.
> 
> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
> 
>> I have been using these documents for a couple weeks, implementing  
>> various store and load functionality, and they have been very helpful.
>> 
>> However, there is room for improvement.  What is most unclear is  
>> when the API methods get called.  Each method should clearly state  
>> in these documents (and the javadoc) when it is called -- front-end  
>> only? back-end only?  both?  Sometimes this is obvious, other times  
>> it is not.
>> For example, without looking at the source code its not possible to  
>> tell or infer if pushProjection() is called on the front-end or back- 
>> end, or both.  It could be implemented by being called on the front- 
>> end, expecting the loader implementation to persist necessary state  
>> to UDFContext for the back-end, or be called only on the back-end,  
>> or both.  One has to look at PigStorage source to see that it  
>> persists the pushProjection information into UDFContext, so its  
>> _probably_ only called on the front-end.
>> 
>> There are also a few types that these interfaces return or are  
>> provided that are completely undocumented.  I had to look at the  
>> source code to figure out what ResourceStatistics does, and how  
>> ResourceSchema should be used.  RequiredField, RequiredFieldList,  
>> and RequiredFieldResponse are all poorly documented aspects of a  
>> public interface.
>> 
>> 
>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>> 
>>> To add to this, there is also a how-to document on how to go about
>>> writing load/store functions from scratch in Pig 0.7 at
>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>> 
>>> Pradeep
>>> 
>>> -----Original Message-----
>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>> Sent: Friday, May 21, 2010 11:33 AM
>>> To: pig-user@hadoop.apache.org
>>> Cc: Eli Collins
>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>> 
>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>> be remembering incorrectly) asked if there was a migration guide for
>>> moving Pig load and store functions from 0.6 to 0.7.  I said there  
>>> was
>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>> had already been posted to
>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>> .  Sorry, I should have included those links in my original slides.
>>> 
>>> Alan.
>> 
>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by Alan Gates <ga...@yahoo-inc.com>.

I've created https://issues.apache.org/jira/browse/PIG-1459 to capture  
the need for a standard serialization method.

Regarding required field list, it is the last option.  I believe the  
name was included since some loaders may think in terms of names  
instead of positions.  I created https://issues.apache.org/jira/browse/PIG-1460 
  to fix the documentation on this.

Alan.


On Jun 15, 2010, at 12:36 PM, Dmitriy Ryaboy wrote:

> This is a good point and I don't want it to fall off the radar.
> Hoping someone can answer the RequiredFieldList question.
>
> -D
>
> On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey  
> <sc...@richrelevance.com>wrote:
>
>> I wish there was better documentation on that too.
>>
>> Looking at the PigStorage code, it serializes an array of Booleans  
>> via
>> UDFContext to the backend.
>>
>> It would be significantly better if Pig serialized the requested  
>> fields for
>> us, provided that pushProjection returned a code that indicated  
>> that the
>> projection would be supported.
>>
>> Forcing users to do that serialization themselves is bug prone,  
>> especially
>> in the presence of nested schemas.
>>
>> The documentation is also poor when it comes to describing what the
>> RequiredFieldList even is.
>>
>> It has a name and an index field.   The code itself seems to allow  
>> for
>> either of these to be filled.  What do they mean?
>>
>> Is it:
>> the schema returned by the loader is:
>> (id: int, name: chararray, department: chararray)
>>
>> The RequiredFieldList is [ ("department", 1) , ("id", 0) ]
>>
>> What does that mean?
>> * The name is the field name requested, and the index is the  
>> location it
>> should be in the result?  so return (id: int, department: chararray)?
>> * The index is the index in the source schema, and the name is for
>> renaming, so return (department: chararray, id: int) (where the  
>> data in
>> department is actualy that from the original's name field)?
>> * The location in the RequiredFieldList array is the 'destination'
>> requested, the name is optional (if the schema had one) and the  
>> index is the
>> location in the original schema.  so the above RequiredFieldList is  
>> actually
>> impossible, since "department" is always index 2.
>>
>> I think it is the last one, but the first idea might be it too.   
>> Either way
>> the javadoc and other documentation does not describe what the  
>> meanings of
>> these values are nor what their possible ranges might be.
>>
>> On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:
>>
>>> I'm trying to figure out how exactly to appropriately implement the
>>> LoadPushDown interface in my LoadFunc implementation. I need to take
>>> the list of column aliases and pass that from the
>>> LoadPushDown.pushProjection(RequiredFieldList) function to make it
>>> available in the getTuple function. I'm kind of new to this so  
>>> forgive
>>> me if this is obvious. From my readings of the mailing list it  
>>> appears
>>> that the pushProjection function is called in the front-end where as
>>> the getTuple function is called in the back-end. How does a LoanFunc
>>> pass information from the front to the back end instances?
>>>
>>> regards, Andrew
>>>
>>> On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <ga...@yahoo-inc.com>
>> wrote:
>>>> A similar need is being expressed by zebra folks here -
>> https://issues.apache.org/jira/browse/PIG-1337.
>>>> You might want to comment/vote on it as it is scheduled for 0.8  
>>>> release.
>>>>
>>>> Loading data in prepareToRead() is fine. For a workaround I think  
>>>> it
>> should be ok to read the data directly from HDFS in each of the  
>> mappers
>> provided you aren't doing any costly namespace operations like  
>> 'listStatus'
>> that can stress the namesystem in the event of thousands of tasks  
>> executing
>> it concurrently.
>>>>
>>>> Regards
>>>> -@nkur
>>>>
>>>> 6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>>>>
>>>>
>>>>
>>>> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
>>>>
>>>>> Scott,
>>>>>      You can set hadoop properties at the time of running your pig
>> script with -D option. So
>>>>> pig -Dhadoop.property.name=something myscript essentially sets the
>> property in the job configuration.
>>>>>
>>>>
>>>> So no programatic configuration of hadoop properties is allowed  
>>>> (where
>> its easier to control) but its allowable to set it at the script  
>> level?  I
>> guess I can do that, but it complicates things.
>>>> Also this is a very poor way to do this.  My script has 600 lines  
>>>> of Pig
>> and ~45 M/R jobs.  Only three of the jobs need the distributed  
>> cache, not
>> all 45.
>>>>
>>>>> Speaking specifically of utilizing the distributed cache  
>>>>> feature, you
>> can just set the filename in LoadFunc constructor and then load the  
>> data in
>> memory in getNext() method if not already loaded.
>>>>>
>>>>
>>>> That is what the original idea was.
>>>>
>>>>> Here is the pig command to set up the distributed cache
>>>>>
>>>>> pig
>> -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/ 
>> distributed-cache#file-name
>>  ---> This name needs to be passed to UDF constructor so that its  
>> available
>> in mapper/reducer's working dir on compute node.
>>>>>      -Dmapred.create.symlink=yes
>>>>>      script.pig
>>>>
>>>> If that property is set, then constructor only needs file-name (the
>> symlink) right?  Right now I'm trying to set those properties using  
>> the
>> DistributedCache static interfaces which means I need to have  
>> access to the
>> full path.
>>>>
>>>>>
>>>>> Implement something like a loadData() method that loads the data  
>>>>> only
>> once and invoke it from getNext() method. The script will work even  
>> in the
>> local mode if the file distributed via distributed cache resides in  
>> the CWD
>> from where script is invoked.
>>>>>
>>>>
>>>> I'm loading the data in prepareToRead(), which seems most  
>>>> appropriate.
>> Do you see any problem with that?
>>>>
>>>>> Hope that's helpful.
>>>>
>>>> I think the command line property hack is insufficient.  I am  
>>>> left with
>> a choice of having a couple jobs read the file from HDFS directly  
>> in their
>> mappers, or having all jobs unnecessarily set up distributed  
>> cache.  Job
>> setup time is already 1/4 of my processing time.
>>>> Is there a feature request for Load/Store access to Hadoop job
>> configuration properties?
>>>>
>>>> Ideally, this would be a method on LoadFunc that passes a  
>>>> modifiable
>> Configuration object in on the front-end, or a callback for a user to
>> optionally provide a Configuration object with the few properties  
>> you want
>> to alter in it that Pig can apply to the real thing before it  
>> configures its
>> properties.
>>>>
>>>> Thanks for the info Ankur,
>>>>
>>>> -Scott
>>>>
>>>>>
>>>>> -@nkur
>>>>>
>>>>> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>>>>>
>>>>> So, here are some things I'm struggling with now:
>>>>>
>>>>> In a LoadFunc, If I want to load something into  
>>>>> DistributedCache.  The
>> path is passed into the LoadFunc constructor as an argument.
>>>>> Documentation on getSchema() and all other metadata methods  
>>>>> state that
>> you can't modify the job or its configuration passed in.  I've  
>> verified that
>> changes to the Configuration are ignored if set here.
>>>>>
>>>>> It appears that I could set these properties in setLocation()  
>>>>> but that
>> is called a lot on the back-end too, and the documentation does not  
>> state if
>> setLocation() is called at all on the front-end.  Based on my  
>> experimental
>> results, it doesn't seem to.
>>>>> Is there no way to modify Hadoop properties on the front-end to  
>>>>> utilize
>> hadoop features?  UDFContext seems completely useless for setting  
>> hadoop
>> properties for things other than the UDF itself -- like distributed  
>> cache
>> settings.  A stand-alone front-end hook for this would be great.   
>> Otherwise,
>> any hack that works would be acceptable for now.
>>>>>
>>>>>
>>>>> * The documentation for LoadMetadata can use some information  
>>>>> about
>> when each method gets called -- front end only?  Between what other  
>> calls?
>>>>> * UDFContext's documentation needs help too --
>>>>> ** addJobConf() is public, but not expected to be used by end- 
>>>>> users,
>> right?  Several public methods here look like they need better
>> documentation, and the class itself could use a javadoc entry with  
>> some
>> example uses.
>>>>>
>>>>>
>>>>> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
>>>>>
>>>>>> Scott,
>>>>>>
>>>>>> I made an effort to address the documentation in
>> https://issues.apache.org/jira/browse/PIG-1370
>>>>>> If you have a chance take a look and let me know if it deals with
>>>>>> the issues you have or if more work is needed.
>>>>>>
>>>>>> Alan.
>>>>>>
>>>>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>>>>>>
>>>>>>> I have been using these documents for a couple weeks,  
>>>>>>> implementing
>>>>>>> various store and load functionality, and they have been very
>> helpful.
>>>>>>>
>>>>>>> However, there is room for improvement.  What is most unclear is
>>>>>>> when the API methods get called.  Each method should clearly  
>>>>>>> state
>>>>>>> in these documents (and the javadoc) when it is called --  
>>>>>>> front-end
>>>>>>> only? back-end only?  both?  Sometimes this is obvious, other  
>>>>>>> times
>>>>>>> it is not.
>>>>>>> For example, without looking at the source code its not  
>>>>>>> possible to
>>>>>>> tell or infer if pushProjection() is called on the front-end  
>>>>>>> or back-
>>>>>>> end, or both.  It could be implemented by being called on the  
>>>>>>> front-
>>>>>>> end, expecting the loader implementation to persist necessary  
>>>>>>> state
>>>>>>> to UDFContext for the back-end, or be called only on the back- 
>>>>>>> end,
>>>>>>> or both.  One has to look at PigStorage source to see that it
>>>>>>> persists the pushProjection information into UDFContext, so its
>>>>>>> _probably_ only called on the front-end.
>>>>>>>
>>>>>>> There are also a few types that these interfaces return or are
>>>>>>> provided that are completely undocumented.  I had to look at the
>>>>>>> source code to figure out what ResourceStatistics does, and how
>>>>>>> ResourceSchema should be used.  RequiredField,  
>>>>>>> RequiredFieldList,
>>>>>>> and RequiredFieldResponse are all poorly documented aspects of a
>>>>>>> public interface.
>>>>>>>
>>>>>>>
>>>>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>>>>>>
>>>>>>>> To add to this, there is also a how-to document on how to go  
>>>>>>>> about
>>>>>>>> writing load/store functions from scratch in Pig 0.7 at
>>>>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>>>>>>
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>>>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>>>>>> To: pig-user@hadoop.apache.org
>>>>>>>> Cc: Eli Collins
>>>>>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>>>>>>
>>>>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I
>> might
>>>>>>>> be remembering incorrectly) asked if there was a migration  
>>>>>>>> guide for
>>>>>>>> moving Pig load and store functions from 0.6 to 0.7.  I said  
>>>>>>>> there
>>>>>>>> was
>>>>>>>> but I couldn't remember if it had been posted yet or not.  In  
>>>>>>>> fact
>> it
>>>>>>>> had already been posted to
>>>>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>>>>>> .  Also, you can find the list of all incompatible changes  
>>>>>>>> for 0.7
>> at
>>>>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>>>>>> .  Sorry, I should have included those links in my original  
>>>>>>>> slides.
>>>>>>>>
>>>>>>>> Alan.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

This is a good point and I don't want it to fall off the radar.
Hoping someone can answer the RequiredFieldList question.

-D

On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey <sc...@richrelevance.com>wrote:

> I wish there was better documentation on that too.
>
> Looking at the PigStorage code, it serializes an array of Booleans via
> UDFContext to the backend.
>
> It would be significantly better if Pig serialized the requested fields for
> us, provided that pushProjection returned a code that indicated that the
> projection would be supported.
>
> Forcing users to do that serialization themselves is bug prone, especially
> in the presence of nested schemas.
>
> The documentation is also poor when it comes to describing what the
> RequiredFieldList even is.
>
> It has a name and an index field.   The code itself seems to allow for
> either of these to be filled.  What do they mean?
>
> Is it:
> the schema returned by the loader is:
>  (id: int, name: chararray, department: chararray)
>
> The RequiredFieldList is [ ("department", 1) , ("id", 0) ]
>
> What does that mean?
> * The name is the field name requested, and the index is the location it
> should be in the result?  so return (id: int, department: chararray)?
> * The index is the index in the source schema, and the name is for
> renaming, so return (department: chararray, id: int) (where the data in
> department is actualy that from the original's name field)?
> * The location in the RequiredFieldList array is the 'destination'
> requested, the name is optional (if the schema had one) and the index is the
> location in the original schema.  so the above RequiredFieldList is actually
> impossible, since "department" is always index 2.
>
> I think it is the last one, but the first idea might be it too.  Either way
> the javadoc and other documentation does not describe what the meanings of
> these values are nor what their possible ranges might be.
>
> On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:
>
> > I'm trying to figure out how exactly to appropriately implement the
> > LoadPushDown interface in my LoadFunc implementation. I need to take
> > the list of column aliases and pass that from the
> > LoadPushDown.pushProjection(RequiredFieldList) function to make it
> > available in the getTuple function. I'm kind of new to this so forgive
> > me if this is obvious. From my readings of the mailing list it appears
> > that the pushProjection function is called in the front-end where as
> > the getTuple function is called in the back-end. How does a LoanFunc
> > pass information from the front to the back end instances?
> >
> > regards, Andrew
> >
> > On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <ga...@yahoo-inc.com>
> wrote:
> >> A similar need is being expressed by zebra folks here -
> https://issues.apache.org/jira/browse/PIG-1337.
> >> You might want to comment/vote on it as it is scheduled for 0.8 release.
> >>
> >> Loading data in prepareToRead() is fine. For a workaround I think it
> should be ok to read the data directly from HDFS in each of the mappers
> provided you aren't doing any costly namespace operations like 'listStatus'
> that can stress the namesystem in the event of thousands of tasks executing
> it concurrently.
> >>
> >> Regards
> >> -@nkur
> >>
> >>  6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
> >>
> >>
> >>
> >> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
> >>
> >>> Scott,
> >>>       You can set hadoop properties at the time of running your pig
> script with -D option. So
> >>> pig -Dhadoop.property.name=something myscript essentially sets the
> property in the job configuration.
> >>>
> >>
> >> So no programatic configuration of hadoop properties is allowed (where
> its easier to control) but its allowable to set it at the script level?  I
> guess I can do that, but it complicates things.
> >> Also this is a very poor way to do this.  My script has 600 lines of Pig
> and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not
> all 45.
> >>
> >>> Speaking specifically of utilizing the distributed cache feature, you
> can just set the filename in LoadFunc constructor and then load the data in
> memory in getNext() method if not already loaded.
> >>>
> >>
> >> That is what the original idea was.
> >>
> >>> Here is the pig command to set up the distributed cache
> >>>
> >>> pig
> -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name
>   ---> This name needs to be passed to UDF constructor so that its available
> in mapper/reducer's working dir on compute node.
> >>>       -Dmapred.create.symlink=yes
> >>>       script.pig
> >>
> >> If that property is set, then constructor only needs file-name (the
> symlink) right?  Right now I'm trying to set those properties using the
> DistributedCache static interfaces which means I need to have access to the
> full path.
> >>
> >>>
> >>> Implement something like a loadData() method that loads the data only
> once and invoke it from getNext() method. The script will work even in the
> local mode if the file distributed via distributed cache resides in the CWD
> from where script is invoked.
> >>>
> >>
> >> I'm loading the data in prepareToRead(), which seems most appropriate.
>  Do you see any problem with that?
> >>
> >>> Hope that's helpful.
> >>
> >> I think the command line property hack is insufficient.  I am left with
> a choice of having a couple jobs read the file from HDFS directly in their
> mappers, or having all jobs unnecessarily set up distributed cache.  Job
> setup time is already 1/4 of my processing time.
> >> Is there a feature request for Load/Store access to Hadoop job
> configuration properties?
> >>
> >> Ideally, this would be a method on LoadFunc that passes a modifiable
> Configuration object in on the front-end, or a callback for a user to
> optionally provide a Configuration object with the few properties you want
> to alter in it that Pig can apply to the real thing before it configures its
> properties.
> >>
> >> Thanks for the info Ankur,
> >>
> >> -Scott
> >>
> >>>
> >>> -@nkur
> >>>
> >>> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
> >>>
> >>> So, here are some things I'm struggling with now:
> >>>
> >>> In a LoadFunc, If I want to load something into DistributedCache.  The
> path is passed into the LoadFunc constructor as an argument.
> >>> Documentation on getSchema() and all other metadata methods state that
> you can't modify the job or its configuration passed in.  I've verified that
> changes to the Configuration are ignored if set here.
> >>>
> >>> It appears that I could set these properties in setLocation() but that
> is called a lot on the back-end too, and the documentation does not state if
> setLocation() is called at all on the front-end.  Based on my experimental
> results, it doesn't seem to.
> >>> Is there no way to modify Hadoop properties on the front-end to utilize
> hadoop features?  UDFContext seems completely useless for setting hadoop
> properties for things other than the UDF itself -- like distributed cache
> settings.  A stand-alone front-end hook for this would be great.  Otherwise,
> any hack that works would be acceptable for now.
> >>>
> >>>
> >>> * The documentation for LoadMetadata can use some information about
> when each method gets called -- front end only?  Between what other calls?
> >>> * UDFContext's documentation needs help too --
> >>> ** addJobConf() is public, but not expected to be used by end-users,
> right?  Several public methods here look like they need better
> documentation, and the class itself could use a javadoc entry with some
> example uses.
> >>>
> >>>
> >>> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
> >>>
> >>>> Scott,
> >>>>
> >>>> I made an effort to address the documentation in
> https://issues.apache.org/jira/browse/PIG-1370
> >>>> If you have a chance take a look and let me know if it deals with
> >>>> the issues you have or if more work is needed.
> >>>>
> >>>> Alan.
> >>>>
> >>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
> >>>>
> >>>>> I have been using these documents for a couple weeks, implementing
> >>>>> various store and load functionality, and they have been very
> helpful.
> >>>>>
> >>>>> However, there is room for improvement.  What is most unclear is
> >>>>> when the API methods get called.  Each method should clearly state
> >>>>> in these documents (and the javadoc) when it is called -- front-end
> >>>>> only? back-end only?  both?  Sometimes this is obvious, other times
> >>>>> it is not.
> >>>>> For example, without looking at the source code its not possible to
> >>>>> tell or infer if pushProjection() is called on the front-end or back-
> >>>>> end, or both.  It could be implemented by being called on the front-
> >>>>> end, expecting the loader implementation to persist necessary state
> >>>>> to UDFContext for the back-end, or be called only on the back-end,
> >>>>> or both.  One has to look at PigStorage source to see that it
> >>>>> persists the pushProjection information into UDFContext, so its
> >>>>> _probably_ only called on the front-end.
> >>>>>
> >>>>> There are also a few types that these interfaces return or are
> >>>>> provided that are completely undocumented.  I had to look at the
> >>>>> source code to figure out what ResourceStatistics does, and how
> >>>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
> >>>>> and RequiredFieldResponse are all poorly documented aspects of a
> >>>>> public interface.
> >>>>>
> >>>>>
> >>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
> >>>>>
> >>>>>> To add to this, there is also a how-to document on how to go about
> >>>>>> writing load/store functions from scratch in Pig 0.7 at
> >>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
> >>>>>>
> >>>>>> Pradeep
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>>>>> Sent: Friday, May 21, 2010 11:33 AM
> >>>>>> To: pig-user@hadoop.apache.org
> >>>>>> Cc: Eli Collins
> >>>>>> Subject: Pig loader 0.6 to 0.7 migration guide
> >>>>>>
> >>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I
> might
> >>>>>> be remembering incorrectly) asked if there was a migration guide for
> >>>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
> >>>>>> was
> >>>>>> but I couldn't remember if it had been posted yet or not.  In fact
> it
> >>>>>> had already been posted to
> >>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
> >>>>>> .  Also, you can find the list of all incompatible changes for 0.7
> at
> >>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
> >>>>>> .  Sorry, I should have included those links in my original slides.
> >>>>>>
> >>>>>> Alan.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
>
>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by Scott Carey <sc...@richrelevance.com>.

I wish there was better documentation on that too.

Looking at the PigStorage code, it serializes an array of Booleans via UDFContext to the backend.

It would be significantly better if Pig serialized the requested fields for us, provided that pushProjection returned a code that indicated that the projection would be supported.

Forcing users to do that serialization themselves is bug prone, especially in the presence of nested schemas.   

The documentation is also poor when it comes to describing what the RequiredFieldList even is.

It has a name and an index field.   The code itself seems to allow for either of these to be filled.  What do they mean?

Is it:
the schema returned by the loader is:
 (id: int, name: chararray, department: chararray)

The RequiredFieldList is [ ("department", 1) , ("id", 0) ]

What does that mean?
* The name is the field name requested, and the index is the location it should be in the result?  so return (id: int, department: chararray)?
* The index is the index in the source schema, and the name is for renaming, so return (department: chararray, id: int) (where the data in department is actualy that from the original's name field)? 
* The location in the RequiredFieldList array is the 'destination' requested, the name is optional (if the schema had one) and the index is the location in the original schema.  so the above RequiredFieldList is actually impossible, since "department" is always index 2.

I think it is the last one, but the first idea might be it too.  Either way the javadoc and other documentation does not describe what the meanings of these values are nor what their possible ranges might be.

On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:

> I'm trying to figure out how exactly to appropriately implement the
> LoadPushDown interface in my LoadFunc implementation. I need to take
> the list of column aliases and pass that from the
> LoadPushDown.pushProjection(RequiredFieldList) function to make it
> available in the getTuple function. I'm kind of new to this so forgive
> me if this is obvious. From my readings of the mailing list it appears
> that the pushProjection function is called in the front-end where as
> the getTuple function is called in the back-end. How does a LoanFunc
> pass information from the front to the back end instances?
> 
> regards, Andrew
> 
> On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <ga...@yahoo-inc.com> wrote:
>> A similar need is being expressed by zebra folks here - https://issues.apache.org/jira/browse/PIG-1337.
>> You might want to comment/vote on it as it is scheduled for 0.8 release.
>> 
>> Loading data in prepareToRead() is fine. For a workaround I think it should be ok to read the data directly from HDFS in each of the mappers provided you aren't doing any costly namespace operations like 'listStatus' that can stress the namesystem in the event of thousands of tasks executing it concurrently.
>> 
>> Regards
>> -@nkur
>> 
>>  6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>> 
>> 
>> 
>> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
>> 
>>> Scott,
>>>       You can set hadoop properties at the time of running your pig script with -D option. So
>>> pig -Dhadoop.property.name=something myscript essentially sets the property in the job configuration.
>>> 
>> 
>> So no programatic configuration of hadoop properties is allowed (where its easier to control) but its allowable to set it at the script level?  I guess I can do that, but it complicates things.
>> Also this is a very poor way to do this.  My script has 600 lines of Pig and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.
>> 
>>> Speaking specifically of utilizing the distributed cache feature, you can just set the filename in LoadFunc constructor and then load the data in memory in getNext() method if not already loaded.
>>> 
>> 
>> That is what the original idea was.
>> 
>>> Here is the pig command to set up the distributed cache
>>> 
>>> pig -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name   ---> This name needs to be passed to UDF constructor so that its available in mapper/reducer's working dir on compute node.
>>>       -Dmapred.create.symlink=yes
>>>       script.pig
>> 
>> If that property is set, then constructor only needs file-name (the symlink) right?  Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the full path.
>> 
>>> 
>>> Implement something like a loadData() method that loads the data only once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD from where script is invoked.
>>> 
>> 
>> I'm loading the data in prepareToRead(), which seems most appropriate.  Do you see any problem with that?
>> 
>>> Hope that's helpful.
>> 
>> I think the command line property hack is insufficient.  I am left with a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache.  Job setup time is already 1/4 of my processing time.
>> Is there a feature request for Load/Store access to Hadoop job configuration properties?
>> 
>> Ideally, this would be a method on LoadFunc that passes a modifiable Configuration object in on the front-end, or a callback for a user to optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its properties.
>> 
>> Thanks for the info Ankur,
>> 
>> -Scott
>> 
>>> 
>>> -@nkur
>>> 
>>> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>>> 
>>> So, here are some things I'm struggling with now:
>>> 
>>> In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.
>>> Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.
>>> 
>>> It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.
>>> Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.
>>> 
>>> 
>>> * The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
>>> * UDFContext's documentation needs help too --
>>> ** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses.
>>> 
>>> 
>>> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
>>> 
>>>> Scott,
>>>> 
>>>> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370
>>>> If you have a chance take a look and let me know if it deals with
>>>> the issues you have or if more work is needed.
>>>> 
>>>> Alan.
>>>> 
>>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>>>> 
>>>>> I have been using these documents for a couple weeks, implementing
>>>>> various store and load functionality, and they have been very helpful.
>>>>> 
>>>>> However, there is room for improvement.  What is most unclear is
>>>>> when the API methods get called.  Each method should clearly state
>>>>> in these documents (and the javadoc) when it is called -- front-end
>>>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>>>> it is not.
>>>>> For example, without looking at the source code its not possible to
>>>>> tell or infer if pushProjection() is called on the front-end or back-
>>>>> end, or both.  It could be implemented by being called on the front-
>>>>> end, expecting the loader implementation to persist necessary state
>>>>> to UDFContext for the back-end, or be called only on the back-end,
>>>>> or both.  One has to look at PigStorage source to see that it
>>>>> persists the pushProjection information into UDFContext, so its
>>>>> _probably_ only called on the front-end.
>>>>> 
>>>>> There are also a few types that these interfaces return or are
>>>>> provided that are completely undocumented.  I had to look at the
>>>>> source code to figure out what ResourceStatistics does, and how
>>>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>>>> and RequiredFieldResponse are all poorly documented aspects of a
>>>>> public interface.
>>>>> 
>>>>> 
>>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>>>> 
>>>>>> To add to this, there is also a how-to document on how to go about
>>>>>> writing load/store functions from scratch in Pig 0.7 at
>>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>>>> 
>>>>>> Pradeep
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>>>> To: pig-user@hadoop.apache.org
>>>>>> Cc: Eli Collins
>>>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>>>> 
>>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>>>> be remembering incorrectly) asked if there was a migration guide for
>>>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>>>> was
>>>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>>>> had already been posted to
>>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>>>> .  Sorry, I should have included those links in my original slides.
>>>>>> 
>>>>>> Alan.
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by Andrew Rothstein <an...@gmail.com>.

I'm trying to figure out how exactly to appropriately implement the
LoadPushDown interface in my LoadFunc implementation. I need to take
the list of column aliases and pass that from the
LoadPushDown.pushProjection(RequiredFieldList) function to make it
available in the getTuple function. I'm kind of new to this so forgive
me if this is obvious. From my readings of the mailing list it appears
that the pushProjection function is called in the front-end where as
the getTuple function is called in the back-end. How does a LoanFunc
pass information from the front to the back end instances?

regards, Andrew

On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <ga...@yahoo-inc.com> wrote:
> A similar need is being expressed by zebra folks here - https://issues.apache.org/jira/browse/PIG-1337.
> You might want to comment/vote on it as it is scheduled for 0.8 release.
>
> Loading data in prepareToRead() is fine. For a workaround I think it should be ok to read the data directly from HDFS in each of the mappers provided you aren't doing any costly namespace operations like 'listStatus' that can stress the namesystem in the event of thousands of tasks executing it concurrently.
>
> Regards
> -@nkur
>
>  6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>
>
>
> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
>
>> Scott,
>>       You can set hadoop properties at the time of running your pig script with -D option. So
>> pig -Dhadoop.property.name=something myscript essentially sets the property in the job configuration.
>>
>
> So no programatic configuration of hadoop properties is allowed (where its easier to control) but its allowable to set it at the script level?  I guess I can do that, but it complicates things.
> Also this is a very poor way to do this.  My script has 600 lines of Pig and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.
>
>> Speaking specifically of utilizing the distributed cache feature, you can just set the filename in LoadFunc constructor and then load the data in memory in getNext() method if not already loaded.
>>
>
> That is what the original idea was.
>
>> Here is the pig command to set up the distributed cache
>>
>> pig -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name   ---> This name needs to be passed to UDF constructor so that its available in mapper/reducer's working dir on compute node.
>>       -Dmapred.create.symlink=yes
>>       script.pig
>
> If that property is set, then constructor only needs file-name (the symlink) right?  Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the full path.
>
>>
>> Implement something like a loadData() method that loads the data only once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD from where script is invoked.
>>
>
> I'm loading the data in prepareToRead(), which seems most appropriate.  Do you see any problem with that?
>
>> Hope that's helpful.
>
> I think the command line property hack is insufficient.  I am left with a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache.  Job setup time is already 1/4 of my processing time.
> Is there a feature request for Load/Store access to Hadoop job configuration properties?
>
> Ideally, this would be a method on LoadFunc that passes a modifiable Configuration object in on the front-end, or a callback for a user to optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its properties.
>
> Thanks for the info Ankur,
>
> -Scott
>
>>
>> -@nkur
>>
>> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>>
>> So, here are some things I'm struggling with now:
>>
>> In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.
>> Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.
>>
>> It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.
>> Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.
>>
>>
>> * The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
>> * UDFContext's documentation needs help too --
>> ** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses.
>>
>>
>> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
>>
>>> Scott,
>>>
>>> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370
>>> If you have a chance take a look and let me know if it deals with
>>> the issues you have or if more work is needed.
>>>
>>> Alan.
>>>
>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>>>
>>>> I have been using these documents for a couple weeks, implementing
>>>> various store and load functionality, and they have been very helpful.
>>>>
>>>> However, there is room for improvement.  What is most unclear is
>>>> when the API methods get called.  Each method should clearly state
>>>> in these documents (and the javadoc) when it is called -- front-end
>>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>>> it is not.
>>>> For example, without looking at the source code its not possible to
>>>> tell or infer if pushProjection() is called on the front-end or back-
>>>> end, or both.  It could be implemented by being called on the front-
>>>> end, expecting the loader implementation to persist necessary state
>>>> to UDFContext for the back-end, or be called only on the back-end,
>>>> or both.  One has to look at PigStorage source to see that it
>>>> persists the pushProjection information into UDFContext, so its
>>>> _probably_ only called on the front-end.
>>>>
>>>> There are also a few types that these interfaces return or are
>>>> provided that are completely undocumented.  I had to look at the
>>>> source code to figure out what ResourceStatistics does, and how
>>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>>> and RequiredFieldResponse are all poorly documented aspects of a
>>>> public interface.
>>>>
>>>>
>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>>>
>>>>> To add to this, there is also a how-to document on how to go about
>>>>> writing load/store functions from scratch in Pig 0.7 at
>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>>>
>>>>> Pradeep
>>>>>
>>>>> -----Original Message-----
>>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>>> To: pig-user@hadoop.apache.org
>>>>> Cc: Eli Collins
>>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>>>
>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>>> be remembering incorrectly) asked if there was a migration guide for
>>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>>> was
>>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>>> had already been posted to
>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>>> .  Sorry, I should have included those links in my original slides.
>>>>>
>>>>> Alan.
>>>>
>>>
>>
>>
>
>
>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

A similar need is being expressed by zebra folks here - https://issues.apache.org/jira/browse/PIG-1337.
You might want to comment/vote on it as it is scheduled for 0.8 release.

Loading data in prepareToRead() is fine. For a workaround I think it should be ok to read the data directly from HDFS in each of the mappers provided you aren't doing any costly namespace operations like 'listStatus' that can stress the namesystem in the event of thousands of tasks executing it concurrently.

Regards
-@nkur

 6/2/10 10:36 PM, "Scott Carey" <sc...@richrelevance.com> wrote:



On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:

> Scott,
>       You can set hadoop properties at the time of running your pig script with -D option. So
> pig -Dhadoop.property.name=something myscript essentially sets the property in the job configuration.
>

So no programatic configuration of hadoop properties is allowed (where its easier to control) but its allowable to set it at the script level?  I guess I can do that, but it complicates things.
Also this is a very poor way to do this.  My script has 600 lines of Pig and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.

> Speaking specifically of utilizing the distributed cache feature, you can just set the filename in LoadFunc constructor and then load the data in memory in getNext() method if not already loaded.
>

That is what the original idea was.

> Here is the pig command to set up the distributed cache
>
> pig -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name   ---> This name needs to be passed to UDF constructor so that its available in mapper/reducer's working dir on compute node.
>       -Dmapred.create.symlink=yes
>       script.pig

If that property is set, then constructor only needs file-name (the symlink) right?  Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the full path.

>
> Implement something like a loadData() method that loads the data only once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD from where script is invoked.
>

I'm loading the data in prepareToRead(), which seems most appropriate.  Do you see any problem with that?

> Hope that's helpful.

I think the command line property hack is insufficient.  I am left with a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache.  Job setup time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job configuration properties?

Ideally, this would be a method on LoadFunc that passes a modifiable Configuration object in on the front-end, or a callback for a user to optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its properties.

Thanks for the info Ankur,

-Scott

>
> -@nkur
>
> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>
> So, here are some things I'm struggling with now:
>
> In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.
> Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.
>
> It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.
> Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.
>
>
> * The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
> * UDFContext's documentation needs help too --
> ** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses.
>
>
> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
>
>> Scott,
>>
>> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370
>> If you have a chance take a look and let me know if it deals with
>> the issues you have or if more work is needed.
>>
>> Alan.
>>
>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>>
>>> I have been using these documents for a couple weeks, implementing
>>> various store and load functionality, and they have been very helpful.
>>>
>>> However, there is room for improvement.  What is most unclear is
>>> when the API methods get called.  Each method should clearly state
>>> in these documents (and the javadoc) when it is called -- front-end
>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>> it is not.
>>> For example, without looking at the source code its not possible to
>>> tell or infer if pushProjection() is called on the front-end or back-
>>> end, or both.  It could be implemented by being called on the front-
>>> end, expecting the loader implementation to persist necessary state
>>> to UDFContext for the back-end, or be called only on the back-end,
>>> or both.  One has to look at PigStorage source to see that it
>>> persists the pushProjection information into UDFContext, so its
>>> _probably_ only called on the front-end.
>>>
>>> There are also a few types that these interfaces return or are
>>> provided that are completely undocumented.  I had to look at the
>>> source code to figure out what ResourceStatistics does, and how
>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>> and RequiredFieldResponse are all poorly documented aspects of a
>>> public interface.
>>>
>>>
>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>>
>>>> To add to this, there is also a how-to document on how to go about
>>>> writing load/store functions from scratch in Pig 0.7 at
>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>>
>>>> Pradeep
>>>>
>>>> -----Original Message-----
>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>> To: pig-user@hadoop.apache.org
>>>> Cc: Eli Collins
>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>>
>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>> be remembering incorrectly) asked if there was a migration guide for
>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>> was
>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>> had already been posted to
>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>> .  Sorry, I should have included those links in my original slides.
>>>>
>>>> Alan.
>>>
>>
>
>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by Scott Carey <sc...@richrelevance.com>.

On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:

> Scott,
>       You can set hadoop properties at the time of running your pig script with -D option. So
> pig -Dhadoop.property.name=something myscript essentially sets the property in the job configuration.
> 

So no programatic configuration of hadoop properties is allowed (where its easier to control) but its allowable to set it at the script level?  I guess I can do that, but it complicates things.  
Also this is a very poor way to do this.  My script has 600 lines of Pig and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not all 45.

> Speaking specifically of utilizing the distributed cache feature, you can just set the filename in LoadFunc constructor and then load the data in memory in getNext() method if not already loaded.
> 

That is what the original idea was.

> Here is the pig command to set up the distributed cache
> 
> pig -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name   ---> This name needs to be passed to UDF constructor so that its available in mapper/reducer's working dir on compute node.
>       -Dmapred.create.symlink=yes
>       script.pig

If that property is set, then constructor only needs file-name (the symlink) right?  Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the full path.

> 
> Implement something like a loadData() method that loads the data only once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD from where script is invoked.
> 

I'm loading the data in prepareToRead(), which seems most appropriate.  Do you see any problem with that?

> Hope that's helpful.

I think the command line property hack is insufficient.  I am left with a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache.  Job setup time is already 1/4 of my processing time.
Is there a feature request for Load/Store access to Hadoop job configuration properties?

Ideally, this would be a method on LoadFunc that passes a modifiable Configuration object in on the front-end, or a callback for a user to optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its properties.

Thanks for the info Ankur,

-Scott

> 
> -@nkur
> 
> On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
> 
> So, here are some things I'm struggling with now:
> 
> In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.
> Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.
> 
> It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.
> Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.
> 
> 
> * The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
> * UDFContext's documentation needs help too --
> ** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses.
> 
> 
> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
> 
>> Scott,
>> 
>> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370
>> If you have a chance take a look and let me know if it deals with
>> the issues you have or if more work is needed.
>> 
>> Alan.
>> 
>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>> 
>>> I have been using these documents for a couple weeks, implementing
>>> various store and load functionality, and they have been very helpful.
>>> 
>>> However, there is room for improvement.  What is most unclear is
>>> when the API methods get called.  Each method should clearly state
>>> in these documents (and the javadoc) when it is called -- front-end
>>> only? back-end only?  both?  Sometimes this is obvious, other times
>>> it is not.
>>> For example, without looking at the source code its not possible to
>>> tell or infer if pushProjection() is called on the front-end or back-
>>> end, or both.  It could be implemented by being called on the front-
>>> end, expecting the loader implementation to persist necessary state
>>> to UDFContext for the back-end, or be called only on the back-end,
>>> or both.  One has to look at PigStorage source to see that it
>>> persists the pushProjection information into UDFContext, so its
>>> _probably_ only called on the front-end.
>>> 
>>> There are also a few types that these interfaces return or are
>>> provided that are completely undocumented.  I had to look at the
>>> source code to figure out what ResourceStatistics does, and how
>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>>> and RequiredFieldResponse are all poorly documented aspects of a
>>> public interface.
>>> 
>>> 
>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>> 
>>>> To add to this, there is also a how-to document on how to go about
>>>> writing load/store functions from scratch in Pig 0.7 at
>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>> 
>>>> Pradeep
>>>> 
>>>> -----Original Message-----
>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>> Sent: Friday, May 21, 2010 11:33 AM
>>>> To: pig-user@hadoop.apache.org
>>>> Cc: Eli Collins
>>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>> 
>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>>> be remembering incorrectly) asked if there was a migration guide for
>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>>> was
>>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>>> had already been posted to
>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>>> .  Sorry, I should have included those links in my original slides.
>>>> 
>>>> Alan.
>>> 
>> 
> 
>

Re: Pig loader 0.6 to 0.7 migration guide

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

Scott,
       You can set hadoop properties at the time of running your pig script with -D option. So
pig -Dhadoop.property.name=something myscript essentially sets the property in the job configuration.

Speaking specifically of utilizing the distributed cache feature, you can just set the filename in LoadFunc constructor and then load the data in memory in getNext() method if not already loaded.

Here is the pig command to set up the distributed cache

pig -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name   ---> This name needs to be passed to UDF constructor so that its available in mapper/reducer's working dir on compute node.
       -Dmapred.create.symlink=yes
       script.pig

Implement something like a loadData() method that loads the data only once and invoke it from getNext() method. The script will work even in the local mode if the file distributed via distributed cache resides in the CWD from where script is invoked.

Hope that's helpful.

-@nkur

On 6/2/10 2:53 PM, "Scott Carey" <sc...@richrelevance.com> wrote:

So, here are some things I'm struggling with now:

In a LoadFunc, If I want to load something into DistributedCache.  The path is passed into the LoadFunc constructor as an argument.
Documentation on getSchema() and all other metadata methods state that you can't modify the job or its configuration passed in.  I've verified that changes to the Configuration are ignored if set here.

It appears that I could set these properties in setLocation() but that is called a lot on the back-end too, and the documentation does not state if setLocation() is called at all on the front-end.  Based on my experimental results, it doesn't seem to.
Is there no way to modify Hadoop properties on the front-end to utilize hadoop features?  UDFContext seems completely useless for setting hadoop properties for things other than the UDF itself -- like distributed cache settings.  A stand-alone front-end hook for this would be great.  Otherwise, any hack that works would be acceptable for now.

* The documentation for LoadMetadata can use some information about when each method gets called -- front end only?  Between what other calls?
* UDFContext's documentation needs help too --
** addJobConf() is public, but not expected to be used by end-users, right?  Several public methods here look like they need better documentation, and the class itself could use a javadoc entry with some example uses.

On May 24, 2010, at 11:06 AM, Alan Gates wrote:

> Scott,
>
> I made an effort to address the documentation in https://issues.apache.org/jira/browse/PIG-1370
>  If you have a chance take a look and let me know if it deals with
> the issues you have or if more work is needed.
>
> Alan.
>
> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
>
>> I have been using these documents for a couple weeks, implementing
>> various store and load functionality, and they have been very helpful.
>>
>> However, there is room for improvement.  What is most unclear is
>> when the API methods get called.  Each method should clearly state
>> in these documents (and the javadoc) when it is called -- front-end
>> only? back-end only?  both?  Sometimes this is obvious, other times
>> it is not.
>> For example, without looking at the source code its not possible to
>> tell or infer if pushProjection() is called on the front-end or back-
>> end, or both.  It could be implemented by being called on the front-
>> end, expecting the loader implementation to persist necessary state
>> to UDFContext for the back-end, or be called only on the back-end,
>> or both.  One has to look at PigStorage source to see that it
>> persists the pushProjection information into UDFContext, so its
>> _probably_ only called on the front-end.
>>
>> There are also a few types that these interfaces return or are
>> provided that are completely undocumented.  I had to look at the
>> source code to figure out what ResourceStatistics does, and how
>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
>> and RequiredFieldResponse are all poorly documented aspects of a
>> public interface.
>>
>>
>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
>>
>>> To add to this, there is also a how-to document on how to go about
>>> writing load/store functions from scratch in Pig 0.7 at
>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
>>>
>>> Pradeep
>>>
>>> -----Original Message-----
>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>> Sent: Friday, May 21, 2010 11:33 AM
>>> To: pig-user@hadoop.apache.org
>>> Cc: Eli Collins
>>> Subject: Pig loader 0.6 to 0.7 migration guide
>>>
>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might
>>> be remembering incorrectly) asked if there was a migration guide for
>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
>>> was
>>> but I couldn't remember if it had been posted yet or not.  In fact it
>>> had already been posted to
>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
>>> .  Also, you can find the list of all incompatible changes for 0.7 at
>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
>>> .  Sorry, I should have included those links in my original slides.
>>>
>>> Alan.
>>
>