You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Devaraj Das <dd...@yahoo-inc.com> on 2008/09/05 10:28:37 UTC

Re: Sharing Memory across Map task [multiple cores] runing in same machine

Hadoop doesn't support this natively. So if you need this kind of a
functionality, you'd need to code your application in such a way. But I am
worried about the race conditions in determining which task should first
create the ramfs and load the data.
If you can provide atomicity in determining whether the ramfs has been
created and data loaded, and if not, then do the creation/load, then things
should work. 
If atomicity cannot be guaranteed, you might consider this -
1) Run a job with only maps that creates the ramfs and loads the data (if
your cluster is small you can do this manually). You can use distributed
cache to store the data you want to load.
2) Run your job that processes the data
3) Run a third job to delete the ramfs.

On 9/5/08 1:29 PM, "Amit Kumar Singh" <am...@cse.iitb.ac.in> wrote:

> Can we use something like RAM FS to share static data across map tasks.
> 
> Scenario,
> 1) Quadcore machine
> 2) 2 1-TB Disk
> 3) 8 GB ram,
> 
> Now Ii need ~2.7 GB ram per Map process to load some static data in memory
> using which i would be processing data.(cpu intensive jobs)
> 
> Can i share memory across mappers on the same machine so that memory
> footprint is less and i can run more than 4 mappers simultaneously
> utilizing all 4 cores.
> 
> Can we use stuff like RamFS
>

Re: Sharing Memory across Map task [multiple cores] runing in same machineWe

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Well a classical solution to that on Linux would be to mmap a cache file into 
multiple processes. No idea if you can do something like that with Java.

Andreas

On Friday 05 September 2008 10:28:37 Devaraj Das wrote:
> Hadoop doesn't support this natively. So if you need this kind of a
> functionality, you'd need to code your application in such a way. But I am
> worried about the race conditions in determining which task should first
> create the ramfs and load the data.
> If you can provide atomicity in determining whether the ramfs has been
> created and data loaded, and if not, then do the creation/load, then things
> should work.
> If atomicity cannot be guaranteed, you might consider this -
> 1) Run a job with only maps that creates the ramfs and loads the data (if
> your cluster is small you can do this manually). You can use distributed
> cache to store the data you want to load.
> 2) Run your job that processes the data
> 3) Run a third job to delete the ramfs.
>
> On 9/5/08 1:29 PM, "Amit Kumar Singh" <am...@cse.iitb.ac.in> wrote:
> > Can we use something like RAM FS to share static data across map tasks.
> >
> > Scenario,
> > 1) Quadcore machine
> > 2) 2 1-TB Disk
> > 3) 8 GB ram,
> >
> > Now Ii need ~2.7 GB ram per Map process to load some static data in
> > memory using which i would be processing data.(cpu intensive jobs)
> >
> > Can i share memory across mappers on the same machine so that memory
> > footprint is less and i can run more than 4 mappers simultaneously
> > utilizing all 4 cores.
> >
> > Can we use stuff like RamFS