You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2011/06/13 20:54:52 UTC

running pig on amazon ec2

Hi,

This is probably not directly a Pig question.

Anyone running Pig on amazon EC2 instances? Something's not making sense to
me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
cluster using m1.small. It took *13 minutes*. The job reads input from S3
and writes output to S3. But from the logs the reading and writing part
to/from S3 is pretty fast. And all the intermediate steps should happen on
HDFS.

Running the same job on my mbp laptop, it only took *3 minutes*.

Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
on my laptop. Some hadoop config is probably also not ideal. I tried
m1.large instead of m1.small, doesn't seem to make a huge difference.
Anything you would suggest to look for the slowness on EC2?

Dexin

Re: running pig on amazon ec2

Posted by Dexin Wang <wa...@gmail.com>.

Thanks a lot for the good advice.

I'll see if I can get lzo setup. Currently I'm using emr which uses pig 0.6.
I'll looking into whirr to start the hadoop cluster on ec2.

There is one place in my job where I can use replicated join, I'm sure that
will cut down some time.

What I find interesting is without doing any optimization on configuration
or code side, I get 2x to 4x speed up by just using the "*Cluster Compute
Quadruple Extra Large Instance*" (cc1.4xlarge) as oppose to the regular
"Large instance" (m1.large) on the $$. They do claim cc1.4xlarge's IO is
"very high". Since I suspect most of my job was spending time
reading/writing disk, this speedup makes sense.

On Wed, Jun 15, 2011 at 6:46 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> you need to add this to your pig.properties:
>
> pig.tmpfilecompression=true
> pig.tmpfilecompression.codec=lzo
>
> Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
> higher, and that all the lzo stuff is set up -- it's a bit involved.
>
> Use replicated joins where possible.
>
> If you are doing a large number of small jobs, scheduling and
> provisioning is likely to dominate -- tune your job scheduler to
> schedule more tasks per heartbeat and make sure your jar is as small
> as you can get it (there's a lot of unjarring going on in Hadoop)
> D
>
> On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang <wa...@gmail.com> wrote:
> > Tomas,
> >
> > What worked well for me is still to be figured out. Right now, it works
> but
> > it's too slow. I think one of the main problem is that my job has many
> > JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk
> which
> > is slow.
> >
> > On that node, anyone knows how to know if the lzo is turned on for
> > intermediate jobs. Reference to this
> >
> >
> http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs
> >
> > and this
> >
> >
> http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
> >
> > I see I have this in my mapred-site.xml file:
> >
> >    <property><name>mapred.map.output.compression.codec</name>
> > <value>com.hadoop.compression.lzo.LzoCodec</value></property>
> >
> > Is that all I need to have map compression turned on? Thanks.
> >
> > Dexin
> >
> > On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
> > <sv...@gmail.com>wrote:
> >
> >> Hi Dexin,
> >>
> >> Since I am being a Pig and map reduce newbie your post is very
> >> intriguing for me. I am coming from Talend background and trying to
> >> asses if map/reduce would bring any possible speed up and faster
> >> turnaround to my projects. My worries are that my data are to small so
> >> that map reduce overhead will be prohibitive in certain cases.
> >>
> >> When using Talend if the transformation was reasonable it could
> >> process 10s of thousand rows per second. Processing 1 million rows
> >> could be finished well under 1 minute so I think that your dataset is
> >> fairly small. Nevertheless my data are growing so soon it wil be time
> >> for pig.
> >>
> >> Could you provide some info what worked well for you to run your job on
> >> EC2?
> >>
> >> Thanks in advance,
> >>
> >> Tomas
> >>
> >> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <ji...@yahoo-inc.com>
> >> wrote:
> >> > If the job finishes in 3 minutes in local mode, I would think it is
> >> small.
> >> >
> >> > On 06/14/2011 11:07 AM, Dexin Wang wrote:
> >> >>
> >> >> Good to know. Trying single node hadoop cluster now. The main input
> is
> >> >> about 1+ million lines of events. After some aggregation, it joins
> with
> >> >> another input source which has also about 1+ million rows. Is this
> >> >> considered small query? Thanks.
> >> >>
> >> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
> >> >> <ma...@yahoo-inc.com>> wrote:
> >> >>
> >> >>    Local mode and mapreduce mode makes a huge difference. For a small
> >> >>    query, the mapreduce overhead will dominate. For a fair
> >> >>    comparison, can you setup a single node hadoop cluster on your
> >> >>    laptop and run Pig on it?
> >> >>
> >> >>    Daniel
> >> >>
> >> >>
> >> >>    On 06/14/2011 10:54 AM, Dexin Wang wrote:
> >> >>>
> >> >>>    Thanks for your feedback. My comments below.
> >> >>>
> >> >>>    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
> >> >>>    <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
> >> >>>
> >> >>>        Curious, couple of questions:
> >> >>>        1. Are you running in local mode or mapreduce mode?
> >> >>>
> >> >>>    Local mode (-x local) when I ran it on my laptop, and mapreduce
> >> >>>    mode when I ran it on ec2 cluster.
> >> >>>
> >> >>>        2. If mapreduce mode, did you look into the hadoop log to see
> >> >>>        how much slow down each mapreduce job does?
> >> >>>
> >> >>>    I'm looking into that.
> >> >>>
> >> >>>        3. What kind of query is it?
> >> >>>
> >> >>>    The input is gzipped json files which has one event per line.
> >> >>>    Then I do some hourly aggregation on the raw events, then do
> >> >>>    bunch of groupping, joining and some metrics computing (like
> >> >>>    median, variance) on some fields.
> >> >>>
> >> >>>        Daniel
> >> >>>
> >> >>>     Someone mentioned it's EC2's I/O performance. But I'm sure there
> >> >>>    are plenty of people using EC2/EMR running big MR jobs so more
> >> >>>    likely I have some configuration issues? My jobs can be optimized
> >> >>>    a bit but the fact that running on my laptop is faster tells me
> >> >>>    this is a separate issue.
> >> >>>
> >> >>>    Thanks!
> >> >>>
> >> >>>
> >> >>>
> >> >>>        On 06/13/2011 11:54 AM, Dexin Wang wrote:
> >> >>>
> >> >>>            Hi,
> >> >>>
> >> >>>            This is probably not directly a Pig question.
> >> >>>
> >> >>>            Anyone running Pig on amazon EC2 instances? Something's
> >> >>>            not making sense to
> >> >>>            me. I ran a Pig script that has about 10 mapred jobs in
> >> >>>            it on a 16 node
> >> >>>            cluster using m1.small. It took *13 minutes*. The job
> >> >>>            reads input from S3
> >> >>>            and writes output to S3. But from the logs the reading
> >> >>>            and writing part
> >> >>>            to/from S3 is pretty fast. And all the intermediate steps
> >> >>>            should happen on
> >> >>>            HDFS.
> >> >>>
> >> >>>            Running the same job on my mbp laptop, it only took *3
> >> >>>            minutes*.
> >> >>>
> >> >>>            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
> >> >>>            I'll try Pig 0.6
> >> >>>            on my laptop. Some hadoop config is probably also not
> >> >>>            ideal. I tried
> >> >>>            m1.large instead of m1.small, doesn't seem to make a huge
> >> >>>            difference.
> >> >>>            Anything you would suggest to look for the slowness on
> EC2?
> >> >>>
> >> >>>            Dexin
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >>
> >
>

Re: running pig on amazon ec2

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

you need to add this to your pig.properties:

pig.tmpfilecompression=true
pig.tmpfilecompression.codec=lzo

Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
higher, and that all the lzo stuff is set up -- it's a bit involved.

Use replicated joins where possible.

If you are doing a large number of small jobs, scheduling and
provisioning is likely to dominate -- tune your job scheduler to
schedule more tasks per heartbeat and make sure your jar is as small
as you can get it (there's a lot of unjarring going on in Hadoop)
D

On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang <wa...@gmail.com> wrote:
> Tomas,
>
> What worked well for me is still to be figured out. Right now, it works but
> it's too slow. I think one of the main problem is that my job has many
> JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
> is slow.
>
> On that node, anyone knows how to know if the lzo is turned on for
> intermediate jobs. Reference to this
>
> http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs
>
> and this
>
> http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
>
> I see I have this in my mapred-site.xml file:
>
>    <property><name>mapred.map.output.compression.codec</name>
> <value>com.hadoop.compression.lzo.LzoCodec</value></property>
>
> Is that all I need to have map compression turned on? Thanks.
>
> Dexin
>
> On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
> <sv...@gmail.com>wrote:
>
>> Hi Dexin,
>>
>> Since I am being a Pig and map reduce newbie your post is very
>> intriguing for me. I am coming from Talend background and trying to
>> asses if map/reduce would bring any possible speed up and faster
>> turnaround to my projects. My worries are that my data are to small so
>> that map reduce overhead will be prohibitive in certain cases.
>>
>> When using Talend if the transformation was reasonable it could
>> process 10s of thousand rows per second. Processing 1 million rows
>> could be finished well under 1 minute so I think that your dataset is
>> fairly small. Nevertheless my data are growing so soon it wil be time
>> for pig.
>>
>> Could you provide some info what worked well for you to run your job on
>> EC2?
>>
>> Thanks in advance,
>>
>> Tomas
>>
>> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <ji...@yahoo-inc.com>
>> wrote:
>> > If the job finishes in 3 minutes in local mode, I would think it is
>> small.
>> >
>> > On 06/14/2011 11:07 AM, Dexin Wang wrote:
>> >>
>> >> Good to know. Trying single node hadoop cluster now. The main input is
>> >> about 1+ million lines of events. After some aggregation, it joins with
>> >> another input source which has also about 1+ million rows. Is this
>> >> considered small query? Thanks.
>> >>
>> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
>> >> <ma...@yahoo-inc.com>> wrote:
>> >>
>> >>    Local mode and mapreduce mode makes a huge difference. For a small
>> >>    query, the mapreduce overhead will dominate. For a fair
>> >>    comparison, can you setup a single node hadoop cluster on your
>> >>    laptop and run Pig on it?
>> >>
>> >>    Daniel
>> >>
>> >>
>> >>    On 06/14/2011 10:54 AM, Dexin Wang wrote:
>> >>>
>> >>>    Thanks for your feedback. My comments below.
>> >>>
>> >>>    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
>> >>>    <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
>> >>>
>> >>>        Curious, couple of questions:
>> >>>        1. Are you running in local mode or mapreduce mode?
>> >>>
>> >>>    Local mode (-x local) when I ran it on my laptop, and mapreduce
>> >>>    mode when I ran it on ec2 cluster.
>> >>>
>> >>>        2. If mapreduce mode, did you look into the hadoop log to see
>> >>>        how much slow down each mapreduce job does?
>> >>>
>> >>>    I'm looking into that.
>> >>>
>> >>>        3. What kind of query is it?
>> >>>
>> >>>    The input is gzipped json files which has one event per line.
>> >>>    Then I do some hourly aggregation on the raw events, then do
>> >>>    bunch of groupping, joining and some metrics computing (like
>> >>>    median, variance) on some fields.
>> >>>
>> >>>        Daniel
>> >>>
>> >>>     Someone mentioned it's EC2's I/O performance. But I'm sure there
>> >>>    are plenty of people using EC2/EMR running big MR jobs so more
>> >>>    likely I have some configuration issues? My jobs can be optimized
>> >>>    a bit but the fact that running on my laptop is faster tells me
>> >>>    this is a separate issue.
>> >>>
>> >>>    Thanks!
>> >>>
>> >>>
>> >>>
>> >>>        On 06/13/2011 11:54 AM, Dexin Wang wrote:
>> >>>
>> >>>            Hi,
>> >>>
>> >>>            This is probably not directly a Pig question.
>> >>>
>> >>>            Anyone running Pig on amazon EC2 instances? Something's
>> >>>            not making sense to
>> >>>            me. I ran a Pig script that has about 10 mapred jobs in
>> >>>            it on a 16 node
>> >>>            cluster using m1.small. It took *13 minutes*. The job
>> >>>            reads input from S3
>> >>>            and writes output to S3. But from the logs the reading
>> >>>            and writing part
>> >>>            to/from S3 is pretty fast. And all the intermediate steps
>> >>>            should happen on
>> >>>            HDFS.
>> >>>
>> >>>            Running the same job on my mbp laptop, it only took *3
>> >>>            minutes*.
>> >>>
>> >>>            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
>> >>>            I'll try Pig 0.6
>> >>>            on my laptop. Some hadoop config is probably also not
>> >>>            ideal. I tried
>> >>>            m1.large instead of m1.small, doesn't seem to make a huge
>> >>>            difference.
>> >>>            Anything you would suggest to look for the slowness on EC2?
>> >>>
>> >>>            Dexin
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>

Re: running pig on amazon ec2

Posted by Dexin Wang <wa...@gmail.com>.

Tomas,

What worked well for me is still to be figured out. Right now, it works but
it's too slow. I think one of the main problem is that my job has many
JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which
is slow.

On that node, anyone knows how to know if the lzo is turned on for
intermediate jobs. Reference to this

http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs

and this

http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

I see I have this in my mapred-site.xml file:

    <property><name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value></property>

Is that all I need to have map compression turned on? Thanks.

Dexin

On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
<sv...@gmail.com>wrote:

> Hi Dexin,
>
> Since I am being a Pig and map reduce newbie your post is very
> intriguing for me. I am coming from Talend background and trying to
> asses if map/reduce would bring any possible speed up and faster
> turnaround to my projects. My worries are that my data are to small so
> that map reduce overhead will be prohibitive in certain cases.
>
> When using Talend if the transformation was reasonable it could
> process 10s of thousand rows per second. Processing 1 million rows
> could be finished well under 1 minute so I think that your dataset is
> fairly small. Nevertheless my data are growing so soon it wil be time
> for pig.
>
> Could you provide some info what worked well for you to run your job on
> EC2?
>
> Thanks in advance,
>
> Tomas
>
> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <ji...@yahoo-inc.com>
> wrote:
> > If the job finishes in 3 minutes in local mode, I would think it is
> small.
> >
> > On 06/14/2011 11:07 AM, Dexin Wang wrote:
> >>
> >> Good to know. Trying single node hadoop cluster now. The main input is
> >> about 1+ million lines of events. After some aggregation, it joins with
> >> another input source which has also about 1+ million rows. Is this
> >> considered small query? Thanks.
> >>
> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
> >> <ma...@yahoo-inc.com>> wrote:
> >>
> >>    Local mode and mapreduce mode makes a huge difference. For a small
> >>    query, the mapreduce overhead will dominate. For a fair
> >>    comparison, can you setup a single node hadoop cluster on your
> >>    laptop and run Pig on it?
> >>
> >>    Daniel
> >>
> >>
> >>    On 06/14/2011 10:54 AM, Dexin Wang wrote:
> >>>
> >>>    Thanks for your feedback. My comments below.
> >>>
> >>>    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
> >>>    <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
> >>>
> >>>        Curious, couple of questions:
> >>>        1. Are you running in local mode or mapreduce mode?
> >>>
> >>>    Local mode (-x local) when I ran it on my laptop, and mapreduce
> >>>    mode when I ran it on ec2 cluster.
> >>>
> >>>        2. If mapreduce mode, did you look into the hadoop log to see
> >>>        how much slow down each mapreduce job does?
> >>>
> >>>    I'm looking into that.
> >>>
> >>>        3. What kind of query is it?
> >>>
> >>>    The input is gzipped json files which has one event per line.
> >>>    Then I do some hourly aggregation on the raw events, then do
> >>>    bunch of groupping, joining and some metrics computing (like
> >>>    median, variance) on some fields.
> >>>
> >>>        Daniel
> >>>
> >>>     Someone mentioned it's EC2's I/O performance. But I'm sure there
> >>>    are plenty of people using EC2/EMR running big MR jobs so more
> >>>    likely I have some configuration issues? My jobs can be optimized
> >>>    a bit but the fact that running on my laptop is faster tells me
> >>>    this is a separate issue.
> >>>
> >>>    Thanks!
> >>>
> >>>
> >>>
> >>>        On 06/13/2011 11:54 AM, Dexin Wang wrote:
> >>>
> >>>            Hi,
> >>>
> >>>            This is probably not directly a Pig question.
> >>>
> >>>            Anyone running Pig on amazon EC2 instances? Something's
> >>>            not making sense to
> >>>            me. I ran a Pig script that has about 10 mapred jobs in
> >>>            it on a 16 node
> >>>            cluster using m1.small. It took *13 minutes*. The job
> >>>            reads input from S3
> >>>            and writes output to S3. But from the logs the reading
> >>>            and writing part
> >>>            to/from S3 is pretty fast. And all the intermediate steps
> >>>            should happen on
> >>>            HDFS.
> >>>
> >>>            Running the same job on my mbp laptop, it only took *3
> >>>            minutes*.
> >>>
> >>>            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
> >>>            I'll try Pig 0.6
> >>>            on my laptop. Some hadoop config is probably also not
> >>>            ideal. I tried
> >>>            m1.large instead of m1.small, doesn't seem to make a huge
> >>>            difference.
> >>>            Anything you would suggest to look for the slowness on EC2?
> >>>
> >>>            Dexin
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>

Re: running pig on amazon ec2

Posted by Tomas Svarovsky <sv...@gmail.com>.

Hi Dexin,

Since I am being a Pig and map reduce newbie your post is very
intriguing for me. I am coming from Talend background and trying to
asses if map/reduce would bring any possible speed up and faster
turnaround to my projects. My worries are that my data are to small so
that map reduce overhead will be prohibitive in certain cases.

When using Talend if the transformation was reasonable it could
process 10s of thousand rows per second. Processing 1 million rows
could be finished well under 1 minute so I think that your dataset is
fairly small. Nevertheless my data are growing so soon it wil be time
for pig.

Could you provide some info what worked well for you to run your job on EC2?

Thanks in advance,

Tomas

On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
> If the job finishes in 3 minutes in local mode, I would think it is small.
>
> On 06/14/2011 11:07 AM, Dexin Wang wrote:
>>
>> Good to know. Trying single node hadoop cluster now. The main input is
>> about 1+ million lines of events. After some aggregation, it joins with
>> another input source which has also about 1+ million rows. Is this
>> considered small query? Thanks.
>>
>> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com
>> <ma...@yahoo-inc.com>> wrote:
>>
>>    Local mode and mapreduce mode makes a huge difference. For a small
>>    query, the mapreduce overhead will dominate. For a fair
>>    comparison, can you setup a single node hadoop cluster on your
>>    laptop and run Pig on it?
>>
>>    Daniel
>>
>>
>>    On 06/14/2011 10:54 AM, Dexin Wang wrote:
>>>
>>>    Thanks for your feedback. My comments below.
>>>
>>>    On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
>>>    <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
>>>
>>>        Curious, couple of questions:
>>>        1. Are you running in local mode or mapreduce mode?
>>>
>>>    Local mode (-x local) when I ran it on my laptop, and mapreduce
>>>    mode when I ran it on ec2 cluster.
>>>
>>>        2. If mapreduce mode, did you look into the hadoop log to see
>>>        how much slow down each mapreduce job does?
>>>
>>>    I'm looking into that.
>>>
>>>        3. What kind of query is it?
>>>
>>>    The input is gzipped json files which has one event per line.
>>>    Then I do some hourly aggregation on the raw events, then do
>>>    bunch of groupping, joining and some metrics computing (like
>>>    median, variance) on some fields.
>>>
>>>        Daniel
>>>
>>>     Someone mentioned it's EC2's I/O performance. But I'm sure there
>>>    are plenty of people using EC2/EMR running big MR jobs so more
>>>    likely I have some configuration issues? My jobs can be optimized
>>>    a bit but the fact that running on my laptop is faster tells me
>>>    this is a separate issue.
>>>
>>>    Thanks!
>>>
>>>
>>>
>>>        On 06/13/2011 11:54 AM, Dexin Wang wrote:
>>>
>>>            Hi,
>>>
>>>            This is probably not directly a Pig question.
>>>
>>>            Anyone running Pig on amazon EC2 instances? Something's
>>>            not making sense to
>>>            me. I ran a Pig script that has about 10 mapred jobs in
>>>            it on a 16 node
>>>            cluster using m1.small. It took *13 minutes*. The job
>>>            reads input from S3
>>>            and writes output to S3. But from the logs the reading
>>>            and writing part
>>>            to/from S3 is pretty fast. And all the intermediate steps
>>>            should happen on
>>>            HDFS.
>>>
>>>            Running the same job on my mbp laptop, it only took *3
>>>            minutes*.
>>>
>>>            Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
>>>            I'll try Pig 0.6
>>>            on my laptop. Some hadoop config is probably also not
>>>            ideal. I tried
>>>            m1.large instead of m1.small, doesn't seem to make a huge
>>>            difference.
>>>            Anything you would suggest to look for the slowness on EC2?
>>>
>>>            Dexin
>>>
>>>
>>>
>>
>>
>
>

Re: running pig on amazon ec2

Posted by Daniel Dai <ji...@yahoo-inc.com>.

If the job finishes in 3 minutes in local mode, I would think it is small.

On 06/14/2011 11:07 AM, Dexin Wang wrote:
> Good to know. Trying single node hadoop cluster now. The main input is 
> about 1+ million lines of events. After some aggregation, it joins 
> with another input source which has also about 1+ million rows. Is 
> this considered small query? Thanks.
>
> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <jianyong@yahoo-inc.com 
> <ma...@yahoo-inc.com>> wrote:
>
>     Local mode and mapreduce mode makes a huge difference. For a small
>     query, the mapreduce overhead will dominate. For a fair
>     comparison, can you setup a single node hadoop cluster on your
>     laptop and run Pig on it?
>
>     Daniel
>
>
>     On 06/14/2011 10:54 AM, Dexin Wang wrote:
>>     Thanks for your feedback. My comments below.
>>
>>     On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
>>     <jianyong@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
>>
>>         Curious, couple of questions:
>>         1. Are you running in local mode or mapreduce mode?
>>
>>     Local mode (-x local) when I ran it on my laptop, and mapreduce
>>     mode when I ran it on ec2 cluster.
>>
>>         2. If mapreduce mode, did you look into the hadoop log to see
>>         how much slow down each mapreduce job does?
>>
>>     I'm looking into that.
>>
>>         3. What kind of query is it?
>>
>>     The input is gzipped json files which has one event per line.
>>     Then I do some hourly aggregation on the raw events, then do
>>     bunch of groupping, joining and some metrics computing (like
>>     median, variance) on some fields.
>>
>>         Daniel
>>
>>      Someone mentioned it's EC2's I/O performance. But I'm sure there
>>     are plenty of people using EC2/EMR running big MR jobs so more
>>     likely I have some configuration issues? My jobs can be optimized
>>     a bit but the fact that running on my laptop is faster tells me
>>     this is a separate issue.
>>
>>     Thanks!
>>
>>
>>
>>         On 06/13/2011 11:54 AM, Dexin Wang wrote:
>>
>>             Hi,
>>
>>             This is probably not directly a Pig question.
>>
>>             Anyone running Pig on amazon EC2 instances? Something's
>>             not making sense to
>>             me. I ran a Pig script that has about 10 mapred jobs in
>>             it on a 16 node
>>             cluster using m1.small. It took *13 minutes*. The job
>>             reads input from S3
>>             and writes output to S3. But from the logs the reading
>>             and writing part
>>             to/from S3 is pretty fast. And all the intermediate steps
>>             should happen on
>>             HDFS.
>>
>>             Running the same job on my mbp laptop, it only took *3
>>             minutes*.
>>
>>             Amazon is using pig0.6 while I'm using pig 0.8 on laptop.
>>             I'll try Pig 0.6
>>             on my laptop. Some hadoop config is probably also not
>>             ideal. I tried
>>             m1.large instead of m1.small, doesn't seem to make a huge
>>             difference.
>>             Anything you would suggest to look for the slowness on EC2?
>>
>>             Dexin
>>
>>
>>
>
>

Re: running pig on amazon ec2

Posted by Dexin Wang <wa...@gmail.com>.

Good to know. Trying single node hadoop cluster now. The main input is about
1+ million lines of events. After some aggregation, it joins with another
input source which has also about 1+ million rows. Is this considered small
query? Thanks.

On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <ji...@yahoo-inc.com> wrote:

>  Local mode and mapreduce mode makes a huge difference. For a small query,
> the mapreduce overhead will dominate. For a fair comparison, can you setup a
> single node hadoop cluster on your laptop and run Pig on it?
>
> Daniel
>
>
> On 06/14/2011 10:54 AM, Dexin Wang wrote:
>
> Thanks for your feedback. My comments below.
>
> On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <ji...@yahoo-inc.com>wrote:
>
>> Curious, couple of questions:
>> 1. Are you running in local mode or mapreduce mode?
>>
> Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
> ran it on ec2 cluster.
>
>  2. If mapreduce mode, did you look into the hadoop log to see how much
>> slow down each mapreduce job does?
>>
> I'm looking into that.
>
>
>> 3. What kind of query is it?
>>
>>  The input is gzipped json files which has one event per line. Then I do
> some hourly aggregation on the raw events, then do bunch of groupping,
> joining and some metrics computing (like median, variance) on some fields.
>
>  Daniel
>>
>>    Someone mentioned it's EC2's I/O performance. But I'm sure there are
> plenty of people using EC2/EMR running big MR jobs so more likely I have
> some configuration issues? My jobs can be optimized a bit but the fact that
> running on my laptop is faster tells me this is a separate issue.
>
> Thanks!
>
>
>
>> On 06/13/2011 11:54 AM, Dexin Wang wrote:
>>
>>> Hi,
>>>
>>> This is probably not directly a Pig question.
>>>
>>> Anyone running Pig on amazon EC2 instances? Something's not making sense
>>> to
>>> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
>>> cluster using m1.small. It took *13 minutes*. The job reads input from S3
>>> and writes output to S3. But from the logs the reading and writing part
>>> to/from S3 is pretty fast. And all the intermediate steps should happen
>>> on
>>> HDFS.
>>>
>>> Running the same job on my mbp laptop, it only took *3 minutes*.
>>>
>>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig
>>> 0.6
>>> on my laptop. Some hadoop config is probably also not ideal. I tried
>>> m1.large instead of m1.small, doesn't seem to make a huge difference.
>>> Anything you would suggest to look for the slowness on EC2?
>>>
>>> Dexin
>>>
>>
>>
>
>

Re: running pig on amazon ec2

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Local mode and mapreduce mode makes a huge difference. For a small 
query, the mapreduce overhead will dominate. For a fair comparison, can 
you setup a single node hadoop cluster on your laptop and run Pig on it?

Daniel

On 06/14/2011 10:54 AM, Dexin Wang wrote:
> Thanks for your feedback. My comments below.
>
> On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <jianyong@yahoo-inc.com 
> <ma...@yahoo-inc.com>> wrote:
>
>     Curious, couple of questions:
>     1. Are you running in local mode or mapreduce mode?
>
> Local mode (-x local) when I ran it on my laptop, and mapreduce mode 
> when I ran it on ec2 cluster.
>
>     2. If mapreduce mode, did you look into the hadoop log to see how
>     much slow down each mapreduce job does?
>
> I'm looking into that.
>
>     3. What kind of query is it?
>
> The input is gzipped json files which has one event per line. Then I 
> do some hourly aggregation on the raw events, then do bunch of 
> groupping, joining and some metrics computing (like median, variance) 
> on some fields.
>
>     Daniel
>
>  Someone mentioned it's EC2's I/O performance. But I'm sure there are 
> plenty of people using EC2/EMR running big MR jobs so more likely I 
> have some configuration issues? My jobs can be optimized a bit but the 
> fact that running on my laptop is faster tells me this is a separate 
> issue.
>
> Thanks!
>
>
>
>     On 06/13/2011 11:54 AM, Dexin Wang wrote:
>
>         Hi,
>
>         This is probably not directly a Pig question.
>
>         Anyone running Pig on amazon EC2 instances? Something's not
>         making sense to
>         me. I ran a Pig script that has about 10 mapred jobs in it on
>         a 16 node
>         cluster using m1.small. It took *13 minutes*. The job reads
>         input from S3
>         and writes output to S3. But from the logs the reading and
>         writing part
>         to/from S3 is pretty fast. And all the intermediate steps
>         should happen on
>         HDFS.
>
>         Running the same job on my mbp laptop, it only took *3 minutes*.
>
>         Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll
>         try Pig 0.6
>         on my laptop. Some hadoop config is probably also not ideal. I
>         tried
>         m1.large instead of m1.small, doesn't seem to make a huge
>         difference.
>         Anything you would suggest to look for the slowness on EC2?
>
>         Dexin
>
>
>

Re: running pig on amazon ec2

Posted by Dexin Wang <wa...@gmail.com>.

Thanks for your feedback. My comments below.

On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> Curious, couple of questions:
> 1. Are you running in local mode or mapreduce mode?
>
Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
ran it on ec2 cluster.

2. If mapreduce mode, did you look into the hadoop log to see how much slow
> down each mapreduce job does?
>
I'm looking into that.


> 3. What kind of query is it?
>
> The input is gzipped json files which has one event per line. Then I do
some hourly aggregation on the raw events, then do bunch of groupping,
joining and some metrics computing (like median, variance) on some fields.

Daniel
>
>  Someone mentioned it's EC2's I/O performance. But I'm sure there are
plenty of people using EC2/EMR running big MR jobs so more likely I have
some configuration issues? My jobs can be optimized a bit but the fact that
running on my laptop is faster tells me this is a separate issue.

Thanks!



> On 06/13/2011 11:54 AM, Dexin Wang wrote:
>
>> Hi,
>>
>> This is probably not directly a Pig question.
>>
>> Anyone running Pig on amazon EC2 instances? Something's not making sense
>> to
>> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
>> cluster using m1.small. It took *13 minutes*. The job reads input from S3
>> and writes output to S3. But from the logs the reading and writing part
>> to/from S3 is pretty fast. And all the intermediate steps should happen on
>> HDFS.
>>
>> Running the same job on my mbp laptop, it only took *3 minutes*.
>>
>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
>> on my laptop. Some hadoop config is probably also not ideal. I tried
>> m1.large instead of m1.small, doesn't seem to make a huge difference.
>> Anything you would suggest to look for the slowness on EC2?
>>
>> Dexin
>>
>
>

Re: running pig on amazon ec2

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Curious, couple of questions:
1. Are you running in local mode or mapreduce mode?
2. If mapreduce mode, did you look into the hadoop log to see how much 
slow down each mapreduce job does?
3. What kind of query is it?

Daniel

On 06/13/2011 11:54 AM, Dexin Wang wrote:
> Hi,
>
> This is probably not directly a Pig question.
>
> Anyone running Pig on amazon EC2 instances? Something's not making sense to
> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
> cluster using m1.small. It took *13 minutes*. The job reads input from S3
> and writes output to S3. But from the logs the reading and writing part
> to/from S3 is pretty fast. And all the intermediate steps should happen on
> HDFS.
>
> Running the same job on my mbp laptop, it only took *3 minutes*.
>
> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
> on my laptop. Some hadoop config is probably also not ideal. I tried
> m1.large instead of m1.small, doesn't seem to make a huge difference.
> Anything you would suggest to look for the slowness on EC2?
>
> Dexin