You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jason Rennie <jr...@gmail.com> on 2008/03/11 21:43:15 UTC

performance

Have been working my way through the Map-Reduce tutorial.  Just got the
WordCount example working.  One thing that concerns me is the time it took
to run.  11 seconds is the fastest it's been able to complete after multiple
tries.  I'm investigating Hadoop to distribute a real-time recommendations
system.  I need sub-1-second response times.  Startup time is not so
important.  I'm wondering what's taking so long.  Appears that slowness is
actually in the mapping and reducing (see job output below).  I'm using
java-sun-jdk-1.6.0_04.  Is each task launching its own jvm?  Could that be
the reason for the slowness?

Jason

08/03/11 20:34:55 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/03/11 20:34:55 INFO mapred.JobClient: Running job: job_200803111826_0005
08/03/11 20:34:56 INFO mapred.JobClient:  map 0% reduce 0%
08/03/11 20:35:02 INFO mapred.JobClient:  map 66% reduce 0%
08/03/11 20:35:04 INFO mapred.JobClient:  map 100% reduce 0%
08/03/11 20:35:11 INFO mapred.JobClient:  map 100% reduce 100%
08/03/11 20:35:12 INFO mapred.JobClient: Job complete: job_200803111826_0005
08/03/11 20:35:12 INFO mapred.JobClient: Counters: 12
08/03/11 20:35:12 INFO mapred.JobClient:   Job Counters
08/03/11 20:35:12 INFO mapred.JobClient:     Launched map tasks=3
08/03/11 20:35:12 INFO mapred.JobClient:     Launched reduce tasks=1
08/03/11 20:35:12 INFO mapred.JobClient:     Data-local map tasks=3
08/03/11 20:35:12 INFO mapred.JobClient:   Map-Reduce Framework
08/03/11 20:35:12 INFO mapred.JobClient:     Map input records=2
08/03/11 20:35:12 INFO mapred.JobClient:     Map output records=8
08/03/11 20:35:12 INFO mapred.JobClient:     Map input bytes=50
08/03/11 20:35:12 INFO mapred.JobClient:     Map output bytes=82
08/03/11 20:35:12 INFO mapred.JobClient:     Combine input records=8
08/03/11 20:35:12 INFO mapred.JobClient:     Combine output records=6
08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input groups=5
08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input records=6
08/03/11 20:35:12 INFO mapred.JobClient:     Reduce output records=5


-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: performance

Posted by Ted Dunning <td...@veoh.com>.

Identity reduce is nice because the result values can be sorted.


On 3/12/08 8:21 AM, "Jason Rennie" <jr...@gmail.com> wrote:

> Map could perform all the dot-products, which is the heavy lifting
> in what we're trying to do.  Might want to do a reduce after that, not
> sure...


Re: performance

Posted by Theodore Van Rooy <mu...@gmail.com>.
I have been using the HDFS, setting the block size to some appropriate level
and the replication as well.  When submitting the job keep in mind that each
block of the file in the HDFS will be passed into your mapping script as
Standard Input.  The datafile calls will be done locally if possible.  This
gives you a lot of options in regard to your replication and block size
settings.

Overall, it's very possible to optimize mapReduce for your specific job, you
just have to know how it does things.  Root around inside the file system
and watch it as it loads up the actual jobs.

Check out the streaming documentation for more idea on how to optimize your
streaming experience.

On Wed, Mar 12, 2008 at 9:21 AM, Jason Rennie <jr...@gmail.com> wrote:

> Hmm... sounds promising :)  How do you distribute the data?  Do you use
> HDFS?  Pass the data directly to the individual nodes?  We really only
> need
> to do the map operation like you.  We need to distribute a matrix * vector
> operation, so we want rows of the matrix distributed across different
> nodes.  Map could perform all the dot-products, which is the heavy lifting
> in what we're trying to do.  Might want to do a reduce after that, not
> sure...
>
> Jason
>
> On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <mu...@gmail.com>
> wrote:
>
> > There is overhead in grabbing local data, moving it in and out of the
> > system
> > and especially if you are running a map reduce job (like wc) which ends
> up
> > mapping, sorting, copying, reducing, and writing again.
> >
> > One way I've found to get around the overhead is to use Hadoop streaming
> > and
> > perform map only tasks.  While they recommend doing it properly with
> >
> > hstream -mapper /bin/cat -reducer /bin/wc
> >
> > I tried:
> >
> > hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> > -numReduceTasks 0
> >
> > (hstream is just an alias to do Hadoop streaming)
> >
> > And saw an immediate speedup on a 1 Gig and 10 Gig file.
> >
> > In the end you may have several output files with the wordcount for each
> > file, but adding those files together is pretty quick and easy.
> >
> > My recommendation is to explore how how you can get away with either
> > Identity Reduces, Maps or no reduces at all.
> >
> > Theo
> >
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: performance

Posted by Jason Rennie <jr...@gmail.com>.
Hmm... sounds promising :)  How do you distribute the data?  Do you use
HDFS?  Pass the data directly to the individual nodes?  We really only need
to do the map operation like you.  We need to distribute a matrix * vector
operation, so we want rows of the matrix distributed across different
nodes.  Map could perform all the dot-products, which is the heavy lifting
in what we're trying to do.  Might want to do a reduce after that, not
sure...

Jason

On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <mu...@gmail.com>
wrote:

> There is overhead in grabbing local data, moving it in and out of the
> system
> and especially if you are running a map reduce job (like wc) which ends up
> mapping, sorting, copying, reducing, and writing again.
>
> One way I've found to get around the overhead is to use Hadoop streaming
> and
> perform map only tasks.  While they recommend doing it properly with
>
> hstream -mapper /bin/cat -reducer /bin/wc
>
> I tried:
>
> hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> -numReduceTasks 0
>
> (hstream is just an alias to do Hadoop streaming)
>
> And saw an immediate speedup on a 1 Gig and 10 Gig file.
>
> In the end you may have several output files with the wordcount for each
> file, but adding those files together is pretty quick and easy.
>
> My recommendation is to explore how how you can get away with either
> Identity Reduces, Maps or no reduces at all.
>
> Theo
>

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: performance

Posted by Theodore Van Rooy <mu...@gmail.com>.
There is overhead in grabbing local data, moving it in and out of the system
and especially if you are running a map reduce job (like wc) which ends up
mapping, sorting, copying, reducing, and writing again.

One way I've found to get around the overhead is to use Hadoop streaming and
perform map only tasks.  While they recommend doing it properly with

hstream -mapper /bin/cat -reducer /bin/wc

I tried:

hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
-numReduceTasks 0

(hstream is just an alias to do Hadoop streaming)

And saw an immediate speedup on a 1 Gig and 10 Gig file.

In the end you may have several output files with the wordcount for each
file, but adding those files together is pretty quick and easy.

My recommendation is to explore how how you can get away with either
Identity Reduces, Maps or no reduces at all.

Theo

On Tue, Mar 11, 2008 at 4:21 PM, Jason Rennie <jr...@gmail.com> wrote:

> On Tue, Mar 11, 2008 at 5:18 PM, Ted Dunning <td...@veoh.com> wrote:
>
> > Yes.  Each task is launching a JVM.
>
>
> Guess that would explain the slowness :)  Is HDFS tuned similarly?  We're
> thinking of possibly distributing our data using HDFS but storing a
> sufficiently small amount of data per node so that the linux kernel could
> buffer it all into memory.  Is there much overhead in grabbing data from
> HDFS if that data is stored locally?
>
> Map reduce is not generally useful for real-time applications.  It is VERY
> > useful for large scale data reductions done in advance of real-time
> > operations.
> >
> > The basic issue is that the major performance contribution of map-reduce
> > architectures is large scale sequential access of data stores.  That is
> > pretty much in contradiction with real-time response.
> >
>
> Gotcha.  We'll consider switching to a batch-style approach, which it
> sounds
> like Hadoop would be perfect for.
>
> Thanks,
>
> Jason
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: performance

Posted by Ted Dunning <td...@veoh.com>.


On 3/11/08 3:21 PM, "Jason Rennie" <jr...@gmail.com> wrote:

> Is there much overhead in grabbing data from
> HDFS if that data is stored locally?

That should be pretty good.

>> Map reduce is not generally useful for real-time applications.  It is VERY
>> useful for large scale data reductions done in advance of real-time
>> operations.
> 
> Gotcha.  We'll consider switching to a batch-style approach, which it sounds
> like Hadoop would be perfect for.

I couldn't possibly comment in detail, but I would have to say that this
seems like a very good approach.

:-)


Re: performance

Posted by Jason Rennie <jr...@gmail.com>.
On Tue, Mar 11, 2008 at 5:18 PM, Ted Dunning <td...@veoh.com> wrote:

> Yes.  Each task is launching a JVM.


Guess that would explain the slowness :)  Is HDFS tuned similarly?  We're
thinking of possibly distributing our data using HDFS but storing a
sufficiently small amount of data per node so that the linux kernel could
buffer it all into memory.  Is there much overhead in grabbing data from
HDFS if that data is stored locally?

Map reduce is not generally useful for real-time applications.  It is VERY
> useful for large scale data reductions done in advance of real-time
> operations.
>
> The basic issue is that the major performance contribution of map-reduce
> architectures is large scale sequential access of data stores.  That is
> pretty much in contradiction with real-time response.
>

Gotcha.  We'll consider switching to a batch-style approach, which it sounds
like Hadoop would be perfect for.

Thanks,

Jason

Re: performance

Posted by Ted Dunning <td...@veoh.com>.
Would you be interested in the grool extension to Groovy described in the
attached README?

I am looking for early collaborators/guinea pigs.


On 3/11/08 1:43 PM, "Jason Rennie" <jr...@gmail.com> wrote:

> Have been working my way through the Map-Reduce tutorial.  Just got the
> WordCount example working.  One thing that concerns me is the time it took
> to run.  11 seconds is the fastest it's been able to complete after multiple
> tries.  I'm investigating Hadoop to distribute a real-time recommendations
> system.  I need sub-1-second response times.  Startup time is not so
> important.  I'm wondering what's taking so long.  Appears that slowness is
> actually in the mapping and reducing (see job output below).  I'm using
> java-sun-jdk-1.6.0_04.  Is each task launching its own jvm?  Could that be
> the reason for the slowness?
> 
> Jason
> 
> 08/03/11 20:34:55 INFO mapred.FileInputFormat: Total input paths to process
> : 2
> 08/03/11 20:34:55 INFO mapred.JobClient: Running job: job_200803111826_0005
> 08/03/11 20:34:56 INFO mapred.JobClient:  map 0% reduce 0%
> 08/03/11 20:35:02 INFO mapred.JobClient:  map 66% reduce 0%
> 08/03/11 20:35:04 INFO mapred.JobClient:  map 100% reduce 0%
> 08/03/11 20:35:11 INFO mapred.JobClient:  map 100% reduce 100%
> 08/03/11 20:35:12 INFO mapred.JobClient: Job complete: job_200803111826_0005
> 08/03/11 20:35:12 INFO mapred.JobClient: Counters: 12
> 08/03/11 20:35:12 INFO mapred.JobClient:   Job Counters
> 08/03/11 20:35:12 INFO mapred.JobClient:     Launched map tasks=3
> 08/03/11 20:35:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 08/03/11 20:35:12 INFO mapred.JobClient:     Data-local map tasks=3
> 08/03/11 20:35:12 INFO mapred.JobClient:   Map-Reduce Framework
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map input records=2
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map output records=8
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map input bytes=50
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map output bytes=82
> 08/03/11 20:35:12 INFO mapred.JobClient:     Combine input records=8
> 08/03/11 20:35:12 INFO mapred.JobClient:     Combine output records=6
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input groups=5
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input records=6
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce output records=5
> 


Re: performance

Posted by Ted Dunning <td...@veoh.com>.
Yes.  Each task is launching a JVM.

Map reduce is not generally useful for real-time applications.  It is VERY
useful for large scale data reductions done in advance of real-time
operations.

The basic issue is that the major performance contribution of map-reduce
architectures is large scale sequential access of data stores.  That is
pretty much in contradiction with real-time response.


On 3/11/08 1:43 PM, "Jason Rennie" <jr...@gmail.com> wrote:

> Have been working my way through the Map-Reduce tutorial.  Just got the
> WordCount example working.  One thing that concerns me is the time it took
> to run.  11 seconds is the fastest it's been able to complete after multiple
> tries.  I'm investigating Hadoop to distribute a real-time recommendations
> system.  I need sub-1-second response times.  Startup time is not so
> important.  I'm wondering what's taking so long.  Appears that slowness is
> actually in the mapping and reducing (see job output below).  I'm using
> java-sun-jdk-1.6.0_04.  Is each task launching its own jvm?  Could that be
> the reason for the slowness?
> 
> Jason
> 
> 08/03/11 20:34:55 INFO mapred.FileInputFormat: Total input paths to process
> : 2
> 08/03/11 20:34:55 INFO mapred.JobClient: Running job: job_200803111826_0005
> 08/03/11 20:34:56 INFO mapred.JobClient:  map 0% reduce 0%
> 08/03/11 20:35:02 INFO mapred.JobClient:  map 66% reduce 0%
> 08/03/11 20:35:04 INFO mapred.JobClient:  map 100% reduce 0%
> 08/03/11 20:35:11 INFO mapred.JobClient:  map 100% reduce 100%
> 08/03/11 20:35:12 INFO mapred.JobClient: Job complete: job_200803111826_0005
> 08/03/11 20:35:12 INFO mapred.JobClient: Counters: 12
> 08/03/11 20:35:12 INFO mapred.JobClient:   Job Counters
> 08/03/11 20:35:12 INFO mapred.JobClient:     Launched map tasks=3
> 08/03/11 20:35:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 08/03/11 20:35:12 INFO mapred.JobClient:     Data-local map tasks=3
> 08/03/11 20:35:12 INFO mapred.JobClient:   Map-Reduce Framework
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map input records=2
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map output records=8
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map input bytes=50
> 08/03/11 20:35:12 INFO mapred.JobClient:     Map output bytes=82
> 08/03/11 20:35:12 INFO mapred.JobClient:     Combine input records=8
> 08/03/11 20:35:12 INFO mapred.JobClient:     Combine output records=6
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input groups=5
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce input records=6
> 08/03/11 20:35:12 INFO mapred.JobClient:     Reduce output records=5
>