You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Scott Carey <sc...@richrelevance.com> on 2009/03/01 20:21:11 UTC

RE: MapReduce jobs with expensive initialization

You could create a singleton class and reference the dictionary stuff in that.  You would probably want this separate from other classes as to control exactly what data is held on to for a long time and what is not.

class Singleton {

private static final _instance Singleton = new Singleton();

private Singleton() {
 ... initialize here, only ever called once per classloader or JVM; 
}

public Singleton getSingleton() {
return _instance;
}

in mapper:

Singleton dictionary = Singleton.getSingleton();

This assumes that each mapper doesn't live in its own classloader space (which would make even static singletons not shareable), and has the drawback that once initialized, that memory associated with the singleton won't go away until the JVM or classloader that hosts it dies. 

I have not tried this myself, and do not know the exact classloader semantics used in the new 'persistent' task JVMs.  They could have a classloader per job, and dispose of those when the job is complete -- though then it is impossible to persist data across jobs but only within them.  Or there could be one permanent persisted classloader, or one per task.   All will behave differently with respect to statics like the above example.

________________________________________
From: Stuart White [stuart.white1@gmail.com]
Sent: Saturday, February 28, 2009 6:06 AM
To: core-user@hadoop.apache.org
Subject: MapReduce jobs with expensive initialization

I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my "main" class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

Thanks!

Re: Setting ctime in HDFS

Posted by Rasit OZDAS <ra...@gmail.com>.

Cosmin, unfortunately there isn't such a method yet (in FileSystem api).

Rasit

2009/3/6 Cosmin Lehene <cl...@adobe.com>

> Hi,
>
> Is there any way to create a file in HDFS and set the creation date(ctime)
> in the file attributes?
>
> Thanks,
> Cosmin
>

Re: MapReduce jobs with expensive initialization

Posted by jason hadoop <ja...@gmail.com>.

You can have your item in a separate jar and pin the reference so that it
becomes perm-gen, which will pin it. Then you can search the class loader
hierarchy for the reference.
A quick scan through the Child.java main loop shows no magic with class
loaders.

I wrote some code to check this against 0.19.0. Very clearly the JVM is
doing nothing special with class loaders.
The classes are loaded exactly once.
The job counters contain details of how many times the map method was called
and the number of times the singleton was taken, including the jvm pid's

I wrote a mapper and a singleton, the singleton has 1 method that returns
the number of times that the getSingleton
The code fragment is from the examples in my book
http://www.apress.com/book/view/9781430219422

On Fri, Mar 6, 2009 at 12:55 PM, Scott Carey <sc...@richrelevance.com>wrote:

> One further thought on this, the mapper jvm may be loading the jar and
> overwriting / throwing away all previous class descriptions from the
> previous map job, which will remove the statics and reinitialize.  In this
> case, the singleton won't work if it is in the job jar.  What will work is
> putting the singleton in a global classpath (shared library not in the job
> jar).
>
>
> On 3/6/09 12:46 PM, "Scott Carey" <sc...@richrelevance.com> wrote:
>
> The difference is that if the whole mapper class itself is being reloaded
> somehow (instantiated by reflection and then de-referenced and gc'd?) the
> static won't work the way you expect.  Not knowing how that works, and
> assuming that statics there don't work, a singleton in another class may
> still work.  The singleton class is certainly not being instantiated by
> reflection so (I believe) only a classloader closing will get rid of it.
>
> At least, its worth a try, since unlike the mapper class, you control how
> it is instantiated.  So the two cases are not the same.
>
>
> On 3/6/09 12:13 AM, "Rasit OZDAS" <ra...@gmail.com> wrote:
>
> Owen, I tried this, it doesn't work.
> I doubt if static singleton method will work either,
> since it's much or less the same.
>
> Rasit
>
> 2009/3/2 Owen O'Malley <om...@apache.org>
>
> >
> > On Mar 2, 2009, at 3:03 AM, Tom White wrote:
> >
> >  I believe the static singleton approach outlined by Scott will work
> >> since the map classes are in a single classloader (but I haven't
> >> actually tried this).
> >>
> >
> > Even easier, you should just be able to do it with static initialization
> in
> > the Mapper class. (I haven't tried it either... )
> >
> > -- Owen
> >
>
>
>
> --
> M. Ra�it �ZDA�
>
>
>

Re: MapReduce jobs with expensive initialization

Posted by Scott Carey <sc...@richrelevance.com>.

One further thought on this, the mapper jvm may be loading the jar and overwriting / throwing away all previous class descriptions from the previous map job, which will remove the statics and reinitialize.  In this case, the singleton won’t work if it is in the job jar.  What will work is putting the singleton in a global classpath (shared library not in the job jar).

On 3/6/09 12:46 PM, "Scott Carey" <sc...@richrelevance.com> wrote:

The difference is that if the whole mapper class itself is being reloaded somehow (instantiated by reflection and then de-referenced and gc’d?) the static won’t work the way you expect.  Not knowing how that works, and assuming that statics there don’t work, a singleton in another class may still work.  The singleton class is certainly not being instantiated by reflection so (I believe) only a classloader closing will get rid of it.

At least, its worth a try, since unlike the mapper class, you control how it is instantiated.  So the two cases are not the same.

On 3/6/09 12:13 AM, "Rasit OZDAS" <ra...@gmail.com> wrote:

Owen, I tried this, it doesn't work.
I doubt if static singleton method will work either,
since it's much or less the same.

Rasit

2009/3/2 Owen O'Malley <om...@apache.org>

>
> On Mar 2, 2009, at 3:03 AM, Tom White wrote:
>
>  I believe the static singleton approach outlined by Scott will work
>> since the map classes are in a single classloader (but I haven't
>> actually tried this).
>>
>
> Even easier, you should just be able to do it with static initialization in
> the Mapper class. (I haven't tried it either... )
>
> -- Owen
>

--
M. Raşit ÖZDAŞ

Re: MapReduce jobs with expensive initialization

Posted by Scott Carey <sc...@richrelevance.com>.

The difference is that if the whole mapper class itself is being reloaded somehow (instantiated by reflection and then de-referenced and gc’d?) the static won’t work the way you expect.  Not knowing how that works, and assuming that statics there don’t work, a singleton in another class may still work.  The singleton class is certainly not being instantiated by reflection so (I believe) only a classloader closing will get rid of it.

At least, its worth a try, since unlike the mapper class, you control how it is instantiated.  So the two cases are not the same.

On 3/6/09 12:13 AM, "Rasit OZDAS" <ra...@gmail.com> wrote:

Owen, I tried this, it doesn't work.
I doubt if static singleton method will work either,
since it's much or less the same.

Rasit

2009/3/2 Owen O'Malley <om...@apache.org>

>
> On Mar 2, 2009, at 3:03 AM, Tom White wrote:
>
>  I believe the static singleton approach outlined by Scott will work
>> since the map classes are in a single classloader (but I haven't
>> actually tried this).
>>
>
> Even easier, you should just be able to do it with static initialization in
> the Mapper class. (I haven't tried it either... )
>
> -- Owen
>

--
M. Raşit ÖZDAŞ

Setting ctime in HDFS

Posted by Cosmin Lehene <cl...@adobe.com>.

Hi,

Is there any way to create a file in HDFS and set the creation date(ctime)
in the file attributes?

Thanks,
Cosmin

Re: MapReduce jobs with expensive initialization

Posted by Rasit OZDAS <ra...@gmail.com>.

Owen, I tried this, it doesn't work.
I doubt if static singleton method will work either,
since it's much or less the same.

Rasit

2009/3/2 Owen O'Malley <om...@apache.org>

>
> On Mar 2, 2009, at 3:03 AM, Tom White wrote:
>
>  I believe the static singleton approach outlined by Scott will work
>> since the map classes are in a single classloader (but I haven't
>> actually tried this).
>>
>
> Even easier, you should just be able to do it with static initialization in
> the Mapper class. (I haven't tried it either... )
>
> -- Owen
>



-- 
M. Raşit ÖZDAŞ

Re: MapReduce jobs with expensive initialization

Posted by Owen O'Malley <om...@apache.org>.

On Mar 2, 2009, at 3:03 AM, Tom White wrote:

> I believe the static singleton approach outlined by Scott will work
> since the map classes are in a single classloader (but I haven't
> actually tried this).

Even easier, you should just be able to do it with static  
initialization in the Mapper class. (I haven't tried it either... )

-- Owen

Re: MapReduce jobs with expensive initialization

Posted by Tom White <to...@cloudera.com>.

On any particular tasktracker slot, task JVMs are shared only between
tasks of the same job. When the job is complete the task JVM will go
away. So there is certainly no sharing between jobs.

I believe the static singleton approach outlined by Scott will work
since the map classes are in a single classloader (but I haven't
actually tried this).

Cheers,
Tom

On Mon, Mar 2, 2009 at 1:39 AM, jason hadoop <ja...@gmail.com> wrote:
> If you have to you can reach through all of the class loaders and find the
> instance of your singleton class that has the data loaded. It is awkward,
> and
> I haven't done this in java since the late 90's. It did work the last time I
> did it.
>
>
> On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey <sc...@richrelevance.com>wrote:
>
>> You could create a singleton class and reference the dictionary stuff in
>> that.  You would probably want this separate from other classes as to
>> control exactly what data is held on to for a long time and what is not.
>>
>> class Singleton {
>>
>> private static final _instance Singleton = new Singleton();
>>
>> private Singleton() {
>>  ... initialize here, only ever called once per classloader or JVM;
>> }
>>
>> public Singleton getSingleton() {
>> return _instance;
>> }
>>
>> in mapper:
>>
>> Singleton dictionary = Singleton.getSingleton();
>>
>> This assumes that each mapper doesn't live in its own classloader space
>> (which would make even static singletons not shareable), and has the
>> drawback that once initialized, that memory associated with the singleton
>> won't go away until the JVM or classloader that hosts it dies.
>>
>> I have not tried this myself, and do not know the exact classloader
>> semantics used in the new 'persistent' task JVMs.  They could have a
>> classloader per job, and dispose of those when the job is complete -- though
>> then it is impossible to persist data across jobs but only within them.  Or
>> there could be one permanent persisted classloader, or one per task.   All
>> will behave differently with respect to statics like the above example.
>>
>> ________________________________________
>> From: Stuart White [stuart.white1@gmail.com]
>> Sent: Saturday, February 28, 2009 6:06 AM
>> To: core-user@hadoop.apache.org
>> Subject: MapReduce jobs with expensive initialization
>>
>> I have a mapreduce job that requires expensive initialization (loading
>> of some large dictionaries before processing).
>>
>> I want to avoid executing this initialization more than necessary.
>>
>> I understand that I need to call setNumTasksToExecutePerJvm to -1 to
>> force mapreduce to reuse JVMs when executing tasks.
>>
>> How I've been performing my initialization is, in my mapper, I
>> override MapReduceBase#configure, read my parms from the JobConf, and
>> load my dictionaries.
>>
>> It appears, from the tests I've run, that even though
>> NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
>> are being created for each task, and therefore I'm still re-running
>> this expensive initialization for each task.
>>
>> So, my question is: how can I avoid re-executing this expensive
>> initialization per-task?  Should I move my initialization code out of
>> my mapper class and into my "main" class?  If so, how do I pass
>> references to the loaded dictionaries from my main class to my mapper?
>>
>> Thanks!
>>
>

Re: MapReduce jobs with expensive initialization

Posted by jason hadoop <ja...@gmail.com>.

If you have to you can reach through all of the class loaders and find the
instance of your singleton class that has the data loaded. It is awkward,
and
I haven't done this in java since the late 90's. It did work the last time I
did it.


On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey <sc...@richrelevance.com>wrote:

> You could create a singleton class and reference the dictionary stuff in
> that.  You would probably want this separate from other classes as to
> control exactly what data is held on to for a long time and what is not.
>
> class Singleton {
>
> private static final _instance Singleton = new Singleton();
>
> private Singleton() {
>  ... initialize here, only ever called once per classloader or JVM;
> }
>
> public Singleton getSingleton() {
> return _instance;
> }
>
> in mapper:
>
> Singleton dictionary = Singleton.getSingleton();
>
> This assumes that each mapper doesn't live in its own classloader space
> (which would make even static singletons not shareable), and has the
> drawback that once initialized, that memory associated with the singleton
> won't go away until the JVM or classloader that hosts it dies.
>
> I have not tried this myself, and do not know the exact classloader
> semantics used in the new 'persistent' task JVMs.  They could have a
> classloader per job, and dispose of those when the job is complete -- though
> then it is impossible to persist data across jobs but only within them.  Or
> there could be one permanent persisted classloader, or one per task.   All
> will behave differently with respect to statics like the above example.
>
> ________________________________________
> From: Stuart White [stuart.white1@gmail.com]
> Sent: Saturday, February 28, 2009 6:06 AM
> To: core-user@hadoop.apache.org
> Subject: MapReduce jobs with expensive initialization
>
> I have a mapreduce job that requires expensive initialization (loading
> of some large dictionaries before processing).
>
> I want to avoid executing this initialization more than necessary.
>
> I understand that I need to call setNumTasksToExecutePerJvm to -1 to
> force mapreduce to reuse JVMs when executing tasks.
>
> How I've been performing my initialization is, in my mapper, I
> override MapReduceBase#configure, read my parms from the JobConf, and
> load my dictionaries.
>
> It appears, from the tests I've run, that even though
> NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
> are being created for each task, and therefore I'm still re-running
> this expensive initialization for each task.
>
> So, my question is: how can I avoid re-executing this expensive
> initialization per-task?  Should I move my initialization code out of
> my mapper class and into my "main" class?  If so, how do I pass
> references to the loaded dictionaries from my main class to my mapper?
>
> Thanks!
>