You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matt Bowyer <ma...@googlemail.com> on 2009/05/10 23:30:10 UTC

sub 60 second performance

Hi,

I am trying to do 'on demand map reduce' - something which will return in
reasonable time (a few seconds).

My dataset is relatively small and can fit into my datanode's memory. Is it
possible to keep a block in the datanode's memory so on the next job the
response will be much quicker? The majority of the time spent during the job
run appears to be during the 'HDFS_BYTES_READ' part of the job. I have tried
using the setNumTasksToExecutePerJvm but the block still seems to be cleared
from memory after the job.

thanks!

Re: sub 60 second performance

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon <to...@cloudera.com> wrote:
> In addition to Jason's suggestion, you could also see about setting some of
> Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
> it should be easy to re-load it onto the cluster if it's lost, so even
> putting dfs.data.dir in /dev/shm might be worth trying.
> You'll probably also want mapred.local.dir in /dev/shm
>
> Note that if in fact you don't have enough RAM to do this, you'll start
> swapping and your performance will suck like crazy :)
>
> That said, you may find that even with all storage in RAM your jobs are
> still too slow. Hadoop isn't optimized for this kind of small-job
> performance quite yet. You may find that task setup time dominates the job.
> I think it's entirely reasonable to shoot for sub-60-second jobs down the
> road, and I'd find it interesting to hear what the results are now. Hope you
> report back!
>
> -Todd
>
> On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <ma...@googlemail.com>wrote:
>
>> Hi,
>>
>> I am trying to do 'on demand map reduce' - something which will return in
>> reasonable time (a few seconds).
>>
>> My dataset is relatively small and can fit into my datanode's memory. Is it
>> possible to keep a block in the datanode's memory so on the next job the
>> response will be much quicker? The majority of the time spent during the
>> job
>> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
>> tried
>> using the setNumTasksToExecutePerJvm but the block still seems to be
>> cleared
>> from memory after the job.
>>
>> thanks!
>>
>

Also if your data set is small your can reduce overhead and
(parallelism) by lowering the number of mappers and reducers.

-Dmapred.map.tasks=11
-Dmapred.reduce.tasks=3

Or maybe even go as low as:

-Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1

I use this tactic on jobs with small data sets where the processing
time is much less then the overhead of starting multiple mappers/
reducers and shuffling data.

Re: sub 60 second performance

Posted by Todd Lipcon <to...@cloudera.com>.
In addition to Jason's suggestion, you could also see about setting some of
Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
it should be easy to re-load it onto the cluster if it's lost, so even
putting dfs.data.dir in /dev/shm might be worth trying.
You'll probably also want mapred.local.dir in /dev/shm

Note that if in fact you don't have enough RAM to do this, you'll start
swapping and your performance will suck like crazy :)

That said, you may find that even with all storage in RAM your jobs are
still too slow. Hadoop isn't optimized for this kind of small-job
performance quite yet. You may find that task setup time dominates the job.
I think it's entirely reasonable to shoot for sub-60-second jobs down the
road, and I'd find it interesting to hear what the results are now. Hope you
report back!

-Todd

On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <ma...@googlemail.com>wrote:

> Hi,
>
> I am trying to do 'on demand map reduce' - something which will return in
> reasonable time (a few seconds).
>
> My dataset is relatively small and can fit into my datanode's memory. Is it
> possible to keep a block in the datanode's memory so on the next job the
> response will be much quicker? The majority of the time spent during the
> job
> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> tried
> using the setNumTasksToExecutePerJvm but the block still seems to be
> cleared
> from memory after the job.
>
> thanks!
>

Re: sub 60 second performance

Posted by jason hadoop <ja...@gmail.com>.
You would need to read the data, and store it in an internal data structure,
or copy the file to a local file system file and mmap it if you didn't want
to store it in java heap space.

Your map then has to deal with the fact that the data isn't being passed in
directly.
This is not straight forward to do.


On Sun, May 10, 2009 at 4:53 PM, Matt Bowyer <ma...@googlemail.com>wrote:

> Thanks Jason, how can I get access to the particular block?
>
> do you mean create a static map inside the task (add the values).. and
> check
> if populated on the next run?
>
> or is there a more elegant/tried&tested solution?
>
> thanks again
>
> On Mon, May 11, 2009 at 12:41 AM, jason hadoop <jason.hadoop@gmail.com
> >wrote:
>
> > You can cache the block in your task, in a pinned static variable, when
> you
> > are reusing the jvms.
> >
> > On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <mattbowyers@googlemail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I am trying to do 'on demand map reduce' - something which will return
> in
> > > reasonable time (a few seconds).
> > >
> > > My dataset is relatively small and can fit into my datanode's memory.
> Is
> > it
> > > possible to keep a block in the datanode's memory so on the next job
> the
> > > response will be much quicker? The majority of the time spent during
> the
> > > job
> > > run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> > > tried
> > > using the setNumTasksToExecutePerJvm but the block still seems to be
> > > cleared
> > > from memory after the job.
> > >
> > > thanks!
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: sub 60 second performance

Posted by Matt Bowyer <ma...@googlemail.com>.
Thanks Jason, how can I get access to the particular block?

do you mean create a static map inside the task (add the values).. and check
if populated on the next run?

or is there a more elegant/tried&tested solution?

thanks again

On Mon, May 11, 2009 at 12:41 AM, jason hadoop <ja...@gmail.com>wrote:

> You can cache the block in your task, in a pinned static variable, when you
> are reusing the jvms.
>
> On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <mattbowyers@googlemail.com
> >wrote:
>
> > Hi,
> >
> > I am trying to do 'on demand map reduce' - something which will return in
> > reasonable time (a few seconds).
> >
> > My dataset is relatively small and can fit into my datanode's memory. Is
> it
> > possible to keep a block in the datanode's memory so on the next job the
> > response will be much quicker? The majority of the time spent during the
> > job
> > run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> > tried
> > using the setNumTasksToExecutePerJvm but the block still seems to be
> > cleared
> > from memory after the job.
> >
> > thanks!
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: sub 60 second performance

Posted by jason hadoop <ja...@gmail.com>.
You can cache the block in your task, in a pinned static variable, when you
are reusing the jvms.

On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <ma...@googlemail.com>wrote:

> Hi,
>
> I am trying to do 'on demand map reduce' - something which will return in
> reasonable time (a few seconds).
>
> My dataset is relatively small and can fit into my datanode's memory. Is it
> possible to keep a block in the datanode's memory so on the next job the
> response will be much quicker? The majority of the time spent during the
> job
> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
> tried
> using the setNumTasksToExecutePerJvm but the block still seems to be
> cleared
> from memory after the job.
>
> thanks!
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals