You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2010/06/05 07:04:28 UTC

[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

    [ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875858#action_12875858 ] 

Scott Carey commented on PIG-1337:
----------------------------------

Why not just allow a loader (or storer) the ability to set things on a conf object directly?  DistributedCache won't be the only thing that I'll want access to.  I don't think Pig will want to add new functions every time a Hadoop feature comes along that one wants access to.

Right now, users can set anything they want with properties on the script command line, but have zero ability to set in compiled code!  This seems backwards to me.   A custom LoadFunc, or StoreFunc should just either have access to the configuration that gets serialized for the job, or, have the ability to return a Configuration object with settings it wishes Pig will pass on (Pig can then ignore or overwrite things that a user should never touch, similar to what happens from command line params).

Perhaps either a:

void configure(Configuration config);

method or

Configuration getCustomConfiguration();

method would be great.  The name for the loader and storer may have to differ as to not collide for classes that implement both, and they should not share the method since the disambiguation would be a problem (a load and store may not both want distributed cache, for example).

> Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
> --------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1337
>                 URL: https://issues.apache.org/jira/browse/PIG-1337
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Chao Wang
>             Fix For: 0.8.0
>
>
> The Zebra storage layer needs to use distributed cache to reduce name node load during job runs.
> To to this, Zebra needs to set up distributed cache related configuration information in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object here is not the one that is being serialized to map/reduce backend. As such, the distributed cache is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use to set up distributed cache information in a conf object, and this conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.