You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by zaki rahaman <za...@gmail.com> on 2009/08/28 20:39:14 UTC

UDFs and Amazon Elastic MapReduce

Hi all,

I had a question about running Pig jobs on Amazon's cloud services.
Specifically, how do you go about adding UDF jar files and what, if any,
modifications to make to a script to make sure it runs effectively via
mapreduce (do you need to ship/cache the udf jar, and if so, how?)

Thanks for all the help so far,


-- 
Zaki Rahaman

RE: UDFs and Amazon Elastic MapReduce

Posted by zjffdu <zj...@gmail.com>.

Hi zaki,

You only need to register the udf jar, and pig will help you distribute the
jar to cluster.

And each time you submit the pig script, pig will distribute udf jar to
cluster.

Best regards,
Jeff zhang

-----Original Message-----
From: zaki rahaman [mailto:zaki.rahaman@gmail.com] 
Sent: 2009年9月2日 7:49
To: pig-user@hadoop.apache.org
Subject: Re: UDFs and Amazon Elastic MapReduce

Apologies for re-posting, but I never got an answer to my question.
Basically, when using UDF jar files, how do you go about ensuring that the
jar file is replicated on all nodes on a cluster and that each node uses its
own local copy of the node and not the 'master' copy (to avoid unnecessary
network traffic and bandwidth issues)? It looks like this is accomplished
via a DEFINE + ship/cache statement but I'm not sure which one is necessary
to use.

On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman <za...@gmail.com>wrote:

> Hi all,
>
> I had a question about running Pig jobs on Amazon's cloud services.
> Specifically, how do you go about adding UDF jar files and what, if any,
> modifications to make to a script to make sure it runs effectively via
> mapreduce (do you need to ship/cache the udf jar, and if so, how?)
>
> Thanks for all the help so far,
>
>
> --
> Zaki Rahaman
>
>

-- 
Zaki Rahaman

Re: UDFs and Amazon Elastic MapReduce

Posted by "Khanna, Richendra" <ri...@amazon.com>.

Hi Zaki,

As part of the enhancements to Pig for it to work well with Amazon Elastic
MapReduce, one of the changes made was to allow the argument passed to
³REGISTER² to come from a remote file system. So for instance you can do:

REGISTER s3://my-bucket/path/to/my/uploaded.jar;

This jar is downloaded to the master by the Grunt shell script at
interpretation time. It is then uploaded to the distributed cache by the
Grunt shell as part of running the job. Thus there is nothing in particular
you need in your jars/script files to ensure they are used in a scalable
fashion.

Also for questions related to our service, you might get a faster response
on our forums 
(http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52&start=
0), since those are actively watched by our team.

Thanks,
Richendra

On 9/2/09 7:48 AM, "zaki rahaman" <za...@gmail.com> wrote:

> Apologies for re-posting, but I never got an answer to my question.
> Basically, when using UDF jar files, how do you go about ensuring that the
> jar file is replicated on all nodes on a cluster and that each node uses its
> own local copy of the node and not the 'master' copy (to avoid unnecessary
> network traffic and bandwidth issues)? It looks like this is accomplished
> via a DEFINE + ship/cache statement but I'm not sure which one is necessary
> to use.
> 
> On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman <za...@gmail.com>wrote:
> 
>> Hi all,
>> 
>> I had a question about running Pig jobs on Amazon's cloud services.
>> Specifically, how do you go about adding UDF jar files and what, if any,
>> modifications to make to a script to make sure it runs effectively via
>> mapreduce (do you need to ship/cache the udf jar, and if so, how?)
>> 
>> Thanks for all the help so far,
>> 
>> 
>> --
>> Zaki Rahaman
>> 
>> 
> 
> 
> --
> Zaki Rahaman
>

Re: UDFs and Amazon Elastic MapReduce

Posted by zaki rahaman <za...@gmail.com>.

Apologies for re-posting, but I never got an answer to my question.
Basically, when using UDF jar files, how do you go about ensuring that the
jar file is replicated on all nodes on a cluster and that each node uses its
own local copy of the node and not the 'master' copy (to avoid unnecessary
network traffic and bandwidth issues)? It looks like this is accomplished
via a DEFINE + ship/cache statement but I'm not sure which one is necessary
to use.

On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman <za...@gmail.com>wrote:

> Hi all,
>
> I had a question about running Pig jobs on Amazon's cloud services.
> Specifically, how do you go about adding UDF jar files and what, if any,
> modifications to make to a script to make sure it runs effectively via
> mapreduce (do you need to ship/cache the udf jar, and if so, how?)
>
> Thanks for all the help so far,
>
>
> --
> Zaki Rahaman
>
>

-- 
Zaki Rahaman