You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Amit Kumar Singh <am...@cse.iitb.ac.in> on 2008/09/05 09:59:38 UTC

Sharing Memory across Map task [multiple cores] runing in same machine

Can we use something like RAM FS to share static data across map tasks.

Scenario,
1) Quadcore machine
2) 2 1-TB Disk
3) 8 GB ram,

Now Ii need ~2.7 GB ram per Map process to load some static data in memory
using which i would be processing data.(cpu intensive jobs)

Can i share memory across mappers on the same machine so that memory
footprint is less and i can run more than 4 mappers simultaneously
utilizing all 4 cores.

Can we use stuff like RamFS

Hadoop custom readers and writers

Posted by Amit Simgh <am...@cse.iitb.ac.in>.

Hi,

I have thousands of  webpages each represented as serialized tree object 
compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).
I have to do some heavy text processing on these pages.

What the the best way to read /access these pages.

Method1
***************
1) Write Custom Splitter that
    1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
around 10 minutes )
    2. Splits the binary data in to parts 10-20
2) Implement specific readers to read a page and present it to mapper

OR.

Method -2
***************
Read the entire file w/o splitting : one one Map task per file.
Implement specific readers to read a page and present it to mapper

Slight detour:
I was browing thru code in FileInputFormat and TextInputFormat. In 
getSplit method the file is broken at arbitary byte boundaries.
So in case of TextInputFormat what if last line of mapper is truncated 
(incomplete byte sequence). what happens.
Can someone explain and give pointers in code where this happens?

I also saw classes like Records . What are these used for?

Re: Sharing Memory across Map task [multiple cores] runing in same machine

Posted by Owen O'Malley <om...@apache.org>.

On Fri, Sep 5, 2008 at 12:59 AM, Amit Kumar Singh
<am...@cse.iitb.ac.in>wrote:

> Can we use something like RAM FS to share static data across map tasks.


As others have said, this won't work right. You probably should look at
MultiThreadMapRunner<http://hadoop.apache.org/core/docs/r0.17.2/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html>,
which uses a thread pool to process the inputs. It is typically used for
crawling or other map methods that take long times per a record. If you have
substantial work inside the map, you can saturate CPUs that way. Of course
the downside is that you have a single RecordReader feeding you inputs, so
you are limited by the reading speed of a single HDFS client.

-- Owen

Sharing Memory across Map task [multiple cores] runing in same machine

Posted by Amit Kumar Singh <am...@cse.iitb.ac.in>.

Can we use something like RAM FS to share static data across map tasks.

Scenario,
1) Quadcore machine
2) 2 1-TB Disk
3) 8 GB ram,

Now Ii need ~2.7 GB ram per Map process to load some static data in memory
using which i would be processing data.(cpu intensive jobs)

Can i share memory across mappers on the same machine so that memory
footprint is less and i can run more than 4 mappers simultaneously
utilizing all 4 cores.

Can we use stuff like RamFS

Re: Sharing Memory across Map task [multiple cores] runing in same machineWe

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Well a classical solution to that on Linux would be to mmap a cache file into 
multiple processes. No idea if you can do something like that with Java.

Andreas

On Friday 05 September 2008 10:28:37 Devaraj Das wrote:
> Hadoop doesn't support this natively. So if you need this kind of a
> functionality, you'd need to code your application in such a way. But I am
> worried about the race conditions in determining which task should first
> create the ramfs and load the data.
> If you can provide atomicity in determining whether the ramfs has been
> created and data loaded, and if not, then do the creation/load, then things
> should work.
> If atomicity cannot be guaranteed, you might consider this -
> 1) Run a job with only maps that creates the ramfs and loads the data (if
> your cluster is small you can do this manually). You can use distributed
> cache to store the data you want to load.
> 2) Run your job that processes the data
> 3) Run a third job to delete the ramfs.
>
> On 9/5/08 1:29 PM, "Amit Kumar Singh" <am...@cse.iitb.ac.in> wrote:
> > Can we use something like RAM FS to share static data across map tasks.
> >
> > Scenario,
> > 1) Quadcore machine
> > 2) 2 1-TB Disk
> > 3) 8 GB ram,
> >
> > Now Ii need ~2.7 GB ram per Map process to load some static data in
> > memory using which i would be processing data.(cpu intensive jobs)
> >
> > Can i share memory across mappers on the same machine so that memory
> > footprint is less and i can run more than 4 mappers simultaneously
> > utilizing all 4 cores.
> >
> > Can we use stuff like RamFS

Re: Sharing Memory across Map task [multiple cores] runing in same machine

Posted by Devaraj Das <dd...@yahoo-inc.com>.

Hadoop doesn't support this natively. So if you need this kind of a
functionality, you'd need to code your application in such a way. But I am
worried about the race conditions in determining which task should first
create the ramfs and load the data.
If you can provide atomicity in determining whether the ramfs has been
created and data loaded, and if not, then do the creation/load, then things
should work. 
If atomicity cannot be guaranteed, you might consider this -
1) Run a job with only maps that creates the ramfs and loads the data (if
your cluster is small you can do this manually). You can use distributed
cache to store the data you want to load.
2) Run your job that processes the data
3) Run a third job to delete the ramfs.

On 9/5/08 1:29 PM, "Amit Kumar Singh" <am...@cse.iitb.ac.in> wrote:

> Can we use something like RAM FS to share static data across map tasks.
> 
> Scenario,
> 1) Quadcore machine
> 2) 2 1-TB Disk
> 3) 8 GB ram,
> 
> Now Ii need ~2.7 GB ram per Map process to load some static data in memory
> using which i would be processing data.(cpu intensive jobs)
> 
> Can i share memory across mappers on the same machine so that memory
> footprint is less and i can run more than 4 mappers simultaneously
> utilizing all 4 cores.
> 
> Can we use stuff like RamFS
>