You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by bharath vissapragada <bh...@gmail.com> on 2009/08/25 05:00:06 UTC

cost model for MR programs

Hi all ,

Is there any general cost model that can be used to guess the run time of a
program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
any config aspects such as number of nodes/page IO/s etc .

Re: cost model for MR programs

Posted by bharath vissapragada <bh...@gmail.com>.
@sanjay

Thanks for your reply .. I too was thinking in the same lines .

@Jeff

I found those links very useful... iam working on similar things . Thanks
for reply! :)



On Sat, Aug 29, 2009 at 11:44 AM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Hey Bharath,
> There has been some work in the research community on predicting the
> runtime
> of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
> always cost-based. At the University of Washington, they've worked on
> progress indicators for Pig; see
> ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
> University of California, Berkeley, they've worked on predicting multiple
> metrics about query progress, first in NeoView, and subsequently in Hadoop;
> see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf<http://www.cs.berkeley.edu/%7Earchanag/publications/ICDE09.pdf>for the
> NeoView results.
>
> Some preliminary design work has been done in the Hive project for
> collecting statistics on Hive tables. See
> https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
> work would be much appreciated!
>
> Regards,
> Jeff
>
> On Fri, Aug 28, 2009 at 7:37 PM, indoos <in...@gmail.com> wrote:
>
> >
> > Hi,
> > My suggestion would be that we should not be compelling ourselves to
> > compare
> > databases with Hadoop.
> > However, here is something not probably even close to what you may
> require,
> > but might be helpful-
> > 1. Number of nodes - these are the parameters to look for -
> > - average time taken by a single Map and Reduce task (available as part
> of
> > history-analytics),
> > - Max Input file size vs block size. Lets take an example- A 6GB input
> file
> > with 64 MB block size would  ideally require ~1000 Maps. The more you
> want
> > to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> > cluster with 10 Maps would have to run ~10 times in a kind of sequential
> > mode :-(
> > - ultimately it is the time vs cost factor to decide the number of nodes.
> > So
> > for this example, if a map takes at least 2 minutes, the ~minimum time
> > would
> > be 2*10=20 minutes. Less time would mean more nodes.
> > - The number of Jobs that you might decide to run at the same time would
> > also affect the number of nodes. Effectively every individual job task
> > (map/reduce) runs in a sequential kind of mode waiting in the queue for
> the
> > existing/executing map/reduce block to finish. (Off course, we have some
> > prioritization support - this does not however help to finish everything
> in
> > parallel)
> > 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job
> Tracker,
> > Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> > task tracker and data node which leaves practically not much for good
> > computing on a commodity 8GB machine. The remaining 5-6 GB can then be
> used
> > for Map Reduce tasks. So with our example of running 10 Maps, we would
> have
> > at the most a Map using at max 400-500 MB heap. Anything beyond this
> would
> > require either the Maps to be reduced or the RAM to be increased.
> > 3. Network speed- Hadoop recommends(I think I did read it
> > somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> > heavy data transfer. My experiences with 100 MB/sec in even a dev env
> have
> > been disastrous
> > 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> > available. So given a 4TB hard disk, effectively only 1 TB can be used
> for
> > real data with 2 TB used for replication (3-ideal replication factor) and
> 1
> > TB for temp usage
> > Regards,
> > Sanjay
> >
> >
> > bharath vissapragada-2 wrote:
> > >
> > > Hi all ,
> > >
> > > Is there any general cost model that can be used to guess the run time
> of
> > > a
> > > program (similar to Page IO/s , selectivity factors in RDBMS) in terms
> of
> > > any config aspects such as number of nodes/page IO/s etc .
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: cost model for MR programs

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey Bharath,
There has been some work in the research community on predicting the runtime
of Hadoop MapReduce, Pig, and Hive jobs, though the approaches are not
always cost-based. At the University of Washington, they've worked on
progress indicators for Pig; see
ftp://ftp.cs.washington.edu/tr/2009/07/UW-CSE-09-07-01.PDF. At the
University of California, Berkeley, they've worked on predicting multiple
metrics about query progress, first in NeoView, and subsequently in Hadoop;
see http://www.cs.berkeley.edu/~archanag/publications/ICDE09.pdf for the
NeoView results.

Some preliminary design work has been done in the Hive project for
collecting statistics on Hive tables. See
https://issues.apache.org/jira/browse/HIVE-33. Any contributions to this
work would be much appreciated!

Regards,
Jeff

On Fri, Aug 28, 2009 at 7:37 PM, indoos <in...@gmail.com> wrote:

>
> Hi,
> My suggestion would be that we should not be compelling ourselves to
> compare
> databases with Hadoop.
> However, here is something not probably even close to what you may require,
> but might be helpful-
> 1. Number of nodes - these are the parameters to look for -
> - average time taken by a single Map and Reduce task (available as part of
> history-analytics),
> - Max Input file size vs block size. Lets take an example- A 6GB input file
> with 64 MB block size would  ideally require ~1000 Maps. The more you want
> to run these 1000 Maps in parallel, more the number of nodes. A 10 node
> cluster with 10 Maps would have to run ~10 times in a kind of sequential
> mode :-(
> - ultimately it is the time vs cost factor to decide the number of nodes.
> So
> for this example, if a map takes at least 2 minutes, the ~minimum time
> would
> be 2*10=20 minutes. Less time would mean more nodes.
> - The number of Jobs that you might decide to run at the same time would
> also affect the number of nodes. Effectively every individual job task
> (map/reduce) runs in a sequential kind of mode waiting in the queue for the
> existing/executing map/reduce block to finish. (Off course, we have some
> prioritization support - this does not however help to finish everything in
> parallel)
> 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker,
> Secondary Name node on the masters side. On slave side- 1 GB RAM each for
> task tracker and data node which leaves practically not much for good
> computing on a commodity 8GB machine. The remaining 5-6 GB can then be used
> for Map Reduce tasks. So with our example of running 10 Maps, we would have
> at the most a Map using at max 400-500 MB heap. Anything beyond this would
> require either the Maps to be reduced or the RAM to be increased.
> 3. Network speed- Hadoop recommends(I think I did read it
> somewhere-apologies if otherwise) using at least 1 GB/s networks for the
> heavy data transfer. My experiences with 100 MB/sec in even a dev env have
> been disastrous
> 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
> available. So given a 4TB hard disk, effectively only 1 TB can be used for
> real data with 2 TB used for replication (3-ideal replication factor) and 1
> TB for temp usage
> Regards,
> Sanjay
>
>
> bharath vissapragada-2 wrote:
> >
> > Hi all ,
> >
> > Is there any general cost model that can be used to guess the run time of
> > a
> > program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
> > any config aspects such as number of nodes/page IO/s etc .
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: cost model for MR programs

Posted by indoos <in...@gmail.com>.
Hi,
My suggestion would be that we should not be compelling ourselves to compare
databases with Hadoop.
However, here is something not probably even close to what you may require,
but might be helpful-
1. Number of nodes - these are the parameters to look for -
- average time taken by a single Map and Reduce task (available as part of
history-analytics), 
- Max Input file size vs block size. Lets take an example- A 6GB input file
with 64 MB block size would  ideally require ~1000 Maps. The more you want
to run these 1000 Maps in parallel, more the number of nodes. A 10 node
cluster with 10 Maps would have to run ~10 times in a kind of sequential
mode :-( 
- ultimately it is the time vs cost factor to decide the number of nodes. So
for this example, if a map takes at least 2 minutes, the ~minimum time would
be 2*10=20 minutes. Less time would mean more nodes.
- The number of Jobs that you might decide to run at the same time would
also affect the number of nodes. Effectively every individual job task
(map/reduce) runs in a sequential kind of mode waiting in the queue for the
existing/executing map/reduce block to finish. (Off course, we have some
prioritization support - this does not however help to finish everything in
parallel)
2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker,
Secondary Name node on the masters side. On slave side- 1 GB RAM each for
task tracker and data node which leaves practically not much for good
computing on a commodity 8GB machine. The remaining 5-6 GB can then be used
for Map Reduce tasks. So with our example of running 10 Maps, we would have
at the most a Map using at max 400-500 MB heap. Anything beyond this would
require either the Maps to be reduced or the RAM to be increased.
3. Network speed- Hadoop recommends(I think I did read it
somewhere-apologies if otherwise) using at least 1 GB/s networks for the
heavy data transfer. My experiences with 100 MB/sec in even a dev env have
been disastrous
4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
available. So given a 4TB hard disk, effectively only 1 TB can be used for
real data with 2 TB used for replication (3-ideal replication factor) and 1
TB for temp usage
Regards,
Sanjay            
 

bharath vissapragada-2 wrote:
> 
> Hi all ,
> 
> Is there any general cost model that can be used to guess the run time of
> a
> program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
> any config aspects such as number of nodes/page IO/s etc .
> 
> 

-- 
View this message in context: http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Fwd: cost model for MR programs

Posted by bharath vissapragada <bh...@gmail.com>.
Hi all ,

Is there any general cost model that can be used to guess the run time of a
Map reduce program (similar to Page IO/s , selectivity factors in RDBMS) in
terms of any config aspects such as number of nodes/page IO/s etc .

Thanks .