You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Igor Nikolic <i....@tudelft.nl> on 2008/06/25 13:33:48 UTC

Is Hadoop the thing for us ?

Hello list

We will be getting access to a cluster soon, and I was wondering whether 
this I should use Hadoop ?  Or am I better of with the usual batch 
schedulers such as ProActive etc ? I am not a CS/CE person, and from 
reading the website I can not get a sense of whether hadoop is for me.

A little background:
We have a  relatively large agent based simulation ( 20+ MB jar) that 
needs to be swept across very large parameter spaces. Agents communicate 
only within the simulation, so there is no interprocess communication. 
The parameter vector is max 20 long , the simulation may take 5-10 
minutes on a normal desktop and it might return a few mb of raw data. We 
need 10k-100K runs, more if possible.



Thanks for advice, even a short yes/no is welcome

Greetings
Igor

-- 
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: i.nikolic@tudelft.nl
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl

Re: Is Hadoop the thing for us ?

Posted by Ted Dunning <te...@gmail.com>.

This can work pretty well if you just use the list of parameter settings as
input.  The map task would run your simulation and output the data.  You may
not even need a reducer, although parallelized summary of output might be
very nice to have.  Because each of your sims takes a long time to run,
hadoop should be very efficient.

The only change you should need to make is to write a map class that
launches your simulation and copies whatever output you want into HDFS
instead of the local file system.  If you can get your sim to write to HDFS
directly that would be better.

On Wed, Jun 25, 2008 at 4:33 AM, Igor Nikolic <i....@tudelft.nl> wrote:

> Hello list
>
> We will be getting access to a cluster soon, and I was wondering whether
> this I should use Hadoop ?  Or am I better of with the usual batch
> schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
> the website I can not get a sense of whether hadoop is for me.
>
> A little background:
> We have a  relatively large agent based simulation ( 20+ MB jar) that needs
> to be swept across very large parameter spaces. Agents communicate only
> within the simulation, so there is no interprocess communication. The
> parameter vector is max 20 long , the simulation may take 5-10 minutes on a
> normal desktop and it might return a few mb of raw data. We need 10k-100K
> runs, more if possible.
>
>
>
> Thanks for advice, even a short yes/no is welcome
>
> Greetings
> Igor
>
> --
> ir. Igor Nikolic
> PhD Researcher
> Section Energy & Industry
> Faculty of Technology, Policy and Management
> Delft University of Technology, The Netherlands
>
> Tel: +31152781135
> Email: i.nikolic@tudelft.nl
> Web: http://www.igornikolic.com
> wiki server: http://wiki.tudelft.nl
>
>

-- 
ted

Re: Is Hadoop the thing for us ?

Posted by Deyaa Adranale <de...@iais.fraunhofer.de>.

here is some informal description of the map/reduce model:

In the map/reduce paradigm there is usually input data consiting of 
(very large number of) records.
the paradigm assumes that you want to do some computation on each input 
record seperately (without simultenous access to other records) to 
produce some result (the map function). Then the results from the whole 
records are grouped (based on a key) and each group of results can be 
futher processed (the reduce function) together to produce a final 
result for each group.
Also, global parameters could be made visible to the map function.

so you have to try to model your problem as this model, and if it is 
possible, then you can  rewrite your porgram or use hadoop native libraries

regards,

Deyaa


Igor Nikolic wrote:
> Thank you for your comment, it did confirm my suspicions.
>
> You framed the problem correctly. I will probably invest a bit of time 
> studying the framework anyway, to see if a rewrite is interesting, 
> since we hit scaling limitations on our Agent scheduler framework. Our 
> main computational load is the massive amount of agent reasoning ( 
> think JbossRules) and  inter-agent communication ( they need to sell 
> and buy stuff to each other)  so I am not sure if it is at all 
> possible to break it down to small tasks, specially if this needs to 
> happen across CPU's, the latency is going to kill us.
>
> Thanks
> igor
>
> John Martyniak wrote:
>> I am new to Hadoop.  So take this information with a grain of salt.
>> But the power of Hadoop is breaking down big problems into small 
>> pieces and
>> spreading it across many (thousands) of machines, in effect creating a
>> massively parallel processing engine.
>>
>> But in order to take advantage of that functionality you must write your
>> application to take advantage of it, using the Hadoop frameworks.
>>
>> So if I understand  your dilemma correctly.  I do not think that 
>> Hadoop is
>> for you, unless you want to re-write your app to take advantage of 
>> it.  And
>> I suspect that if you have access to a traditional cluster, that will 
>> be a
>> better alternative for you.
>>
>> Hope that this helps some.
>>
>> -John
>>
>>
>> On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <i....@tudelft.nl> 
>> wrote:
>>
>>  
>>> Hello list
>>>
>>> We will be getting access to a cluster soon, and I was wondering 
>>> whether
>>> this I should use Hadoop ?  Or am I better of with the usual batch
>>> schedulers such as ProActive etc ? I am not a CS/CE person, and from 
>>> reading
>>> the website I can not get a sense of whether hadoop is for me.
>>>
>>> A little background:
>>> We have a  relatively large agent based simulation ( 20+ MB jar) 
>>> that needs
>>> to be swept across very large parameter spaces. Agents communicate only
>>> within the simulation, so there is no interprocess communication. The
>>> parameter vector is max 20 long , the simulation may take 5-10 
>>> minutes on a
>>> normal desktop and it might return a few mb of raw data. We need 
>>> 10k-100K
>>> runs, more if possible.
>>>
>>>
>>>
>>> Thanks for advice, even a short yes/no is welcome
>>>
>>> Greetings
>>> Igor
>>>
>>> -- 
>>> ir. Igor Nikolic
>>> PhD Researcher
>>> Section Energy & Industry
>>> Faculty of Technology, Policy and Management
>>> Delft University of Technology, The Netherlands
>>>
>>> Tel: +31152781135
>>> Email: i.nikolic@tudelft.nl
>>> Web: http://www.igornikolic.com
>>> wiki server: http://wiki.tudelft.nl
>>>
>>>
>>>     
>>
>>
>>   
>
>

Re: Is Hadoop the thing for us ?

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I do not totally understand you job you are running but if each simulation 
can run independent of each other then you could run a map reduce job that 
will spread the simulation's over many servers so each one can run one or 
more at the same time this will give you a level of protection on servers 
going down and take care of the work on spreading out the work to server 
also this should be able to handle more then the 100K simulation mark you 
stated you would like to run. You would just need to write a the input code 
to handle splitting the simulations into splits that the MR framework could 
work with.

Billy

"Igor Nikolic" <i....@tudelft.nl> wrote in 
message news:4862463C.2070300@tudelft.nl...
> Thank you for your comment, it did confirm my suspicions.
>
> You framed the problem correctly. I will probably invest a bit of time 
> studying the framework anyway, to see if a rewrite is interesting, since 
> we hit scaling limitations on our Agent scheduler framework. Our main 
> computational load is the massive amount of agent reasoning ( think 
> JbossRules) and  inter-agent communication ( they need to sell and buy 
> stuff to each other)  so I am not sure if it is at all possible to break 
> it down to small tasks, specially if this needs to happen across CPU's, 
> the latency is going to kill us.
>
> Thanks
> igor
>
> John Martyniak wrote:
>> I am new to Hadoop.  So take this information with a grain of salt.
>> But the power of Hadoop is breaking down big problems into small pieces 
>> and
>> spreading it across many (thousands) of machines, in effect creating a
>> massively parallel processing engine.
>>
>> But in order to take advantage of that functionality you must write your
>> application to take advantage of it, using the Hadoop frameworks.
>>
>> So if I understand  your dilemma correctly.  I do not think that Hadoop 
>> is
>> for you, unless you want to re-write your app to take advantage of it. 
>> And
>> I suspect that if you have access to a traditional cluster, that will be 
>> a
>> better alternative for you.
>>
>> Hope that this helps some.
>>
>> -John
>>
>>
>> On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic 
>> <i....@tudelft.nl> wrote:
>>
>>
>>> Hello list
>>>
>>> We will be getting access to a cluster soon, and I was wondering whether
>>> this I should use Hadoop ?  Or am I better of with the usual batch
>>> schedulers such as ProActive etc ? I am not a CS/CE person, and from 
>>> reading
>>> the website I can not get a sense of whether hadoop is for me.
>>>
>>> A little background:
>>> We have a  relatively large agent based simulation ( 20+ MB jar) that 
>>> needs
>>> to be swept across very large parameter spaces. Agents communicate only
>>> within the simulation, so there is no interprocess communication. The
>>> parameter vector is max 20 long , the simulation may take 5-10 minutes 
>>> on a
>>> normal desktop and it might return a few mb of raw data. We need 
>>> 10k-100K
>>> runs, more if possible.
>>>
>>>
>>>
>>> Thanks for advice, even a short yes/no is welcome
>>>
>>> Greetings
>>> Igor
>>>
>>> --
>>> ir. Igor Nikolic
>>> PhD Researcher
>>> Section Energy & Industry
>>> Faculty of Technology, Policy and Management
>>> Delft University of Technology, The Netherlands
>>>
>>> Tel: +31152781135
>>> Email: i.nikolic@tudelft.nl
>>> Web: http://www.igornikolic.com
>>> wiki server: http://wiki.tudelft.nl
>>>
>>>
>>>
>>
>>
>>
>
>
> -- 
> ir. Igor Nikolic
> PhD Researcher
> Section Energy & Industry
> Faculty of Technology, Policy and Management
> Delft University of Technology, The Netherlands
>
> Tel: +31152781135
> Email: i.nikolic@tudelft.nl
> Web: http://www.igornikolic.com
> wiki server: http://wiki.tudelft.nl
>
>

Re: Is Hadoop the thing for us ?

Posted by Igor Nikolic <i....@tudelft.nl>.

Thank you for your comment, it did confirm my suspicions.

You framed the problem correctly. I will probably invest a bit of time 
studying the framework anyway, to see if a rewrite is interesting, since 
we hit scaling limitations on our Agent scheduler framework. Our main 
computational load is the massive amount of agent reasoning ( think 
JbossRules) and  inter-agent communication ( they need to sell and buy 
stuff to each other)  so I am not sure if it is at all possible to break 
it down to small tasks, specially if this needs to happen across CPU's, 
the latency is going to kill us.

Thanks
igor

John Martyniak wrote:
> I am new to Hadoop.  So take this information with a grain of salt.
> But the power of Hadoop is breaking down big problems into small pieces and
> spreading it across many (thousands) of machines, in effect creating a
> massively parallel processing engine.
>
> But in order to take advantage of that functionality you must write your
> application to take advantage of it, using the Hadoop frameworks.
>
> So if I understand  your dilemma correctly.  I do not think that Hadoop is
> for you, unless you want to re-write your app to take advantage of it.  And
> I suspect that if you have access to a traditional cluster, that will be a
> better alternative for you.
>
> Hope that this helps some.
>
> -John
>
>
> On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <i....@tudelft.nl> wrote:
>
>   
>> Hello list
>>
>> We will be getting access to a cluster soon, and I was wondering whether
>> this I should use Hadoop ?  Or am I better of with the usual batch
>> schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
>> the website I can not get a sense of whether hadoop is for me.
>>
>> A little background:
>> We have a  relatively large agent based simulation ( 20+ MB jar) that needs
>> to be swept across very large parameter spaces. Agents communicate only
>> within the simulation, so there is no interprocess communication. The
>> parameter vector is max 20 long , the simulation may take 5-10 minutes on a
>> normal desktop and it might return a few mb of raw data. We need 10k-100K
>> runs, more if possible.
>>
>>
>>
>> Thanks for advice, even a short yes/no is welcome
>>
>> Greetings
>> Igor
>>
>> --
>> ir. Igor Nikolic
>> PhD Researcher
>> Section Energy & Industry
>> Faculty of Technology, Policy and Management
>> Delft University of Technology, The Netherlands
>>
>> Tel: +31152781135
>> Email: i.nikolic@tudelft.nl
>> Web: http://www.igornikolic.com
>> wiki server: http://wiki.tudelft.nl
>>
>>
>>     
>
>
>   


-- 
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: i.nikolic@tudelft.nl
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl

Re: Is Hadoop the thing for us ?

Posted by John Martyniak <jo...@beforedawn.com>.

I am new to Hadoop.  So take this information with a grain of salt.
But the power of Hadoop is breaking down big problems into small pieces and
spreading it across many (thousands) of machines, in effect creating a
massively parallel processing engine.

But in order to take advantage of that functionality you must write your
application to take advantage of it, using the Hadoop frameworks.

So if I understand  your dilemma correctly.  I do not think that Hadoop is
for you, unless you want to re-write your app to take advantage of it.  And
I suspect that if you have access to a traditional cluster, that will be a
better alternative for you.

Hope that this helps some.

-John

On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <i....@tudelft.nl> wrote:

> Hello list
>
> We will be getting access to a cluster soon, and I was wondering whether
> this I should use Hadoop ?  Or am I better of with the usual batch
> schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
> the website I can not get a sense of whether hadoop is for me.
>
> A little background:
> We have a  relatively large agent based simulation ( 20+ MB jar) that needs
> to be swept across very large parameter spaces. Agents communicate only
> within the simulation, so there is no interprocess communication. The
> parameter vector is max 20 long , the simulation may take 5-10 minutes on a
> normal desktop and it might return a few mb of raw data. We need 10k-100K
> runs, more if possible.
>
>
>
> Thanks for advice, even a short yes/no is welcome
>
> Greetings
> Igor
>
> --
> ir. Igor Nikolic
> PhD Researcher
> Section Energy & Industry
> Faculty of Technology, Policy and Management
> Delft University of Technology, The Netherlands
>
> Tel: +31152781135
> Email: i.nikolic@tudelft.nl
> Web: http://www.igornikolic.com
> wiki server: http://wiki.tudelft.nl
>
>

-- 
John Martyniak
Before Dawn Solutions, Inc.
9457 S. University Blvd. #266
Highlands Ranch, CO 80126
o: 1-877-499-1562 x707 (Toll Free)
c: 303-522-1756
e: john@beforedawn.com
w: http://www.beforedawn.com