You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tuscany.apache.org by Chris Trezzo <ct...@gmail.com> on 2008/03/30 06:06:21 UTC

GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Hello everyone,

I have posted a rough draft proposal for the project entitled  
"Simplify the development of Map/Reduce applications and their  
integration with various sources of information."

The draft is located here: http://www.cse.ucsd.edu/~ctrezzo/gsocapplication.html

Any comments/suggestions would be highly appreciated.

Thank you for your time.

Chris Trezzo
ctrezzo@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org


Re: GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Posted by Giorgio Zoppi <gi...@gmail.com>.
You could see map and reduce as skeleton:
http://homepages.inf.ed.ac.uk/mic/Skeletons/

Or use map+reduce with ORC
http://www.cs.utexas.edu/~wcook/projects/orc/index.htm
I'm working on the same stuff in this period but with Proactive, (a data
parallel skeleton)
so if you've problems ask.

Just 1c,
Ciao,
Giorgio.
---
Giorgio Zoppi             giorgio.zoppi@gmail.com

You're not your job. You're not how much money you have in the bank.
You're not the car you drive. You're not the contents of your wallet.
You're not your fucking khakis. You're the all-singing, all-dancing
crap of the world.  - Tyler Durden (Fight Club)

---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org


Re: GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Posted by Chris Trezzo <ct...@gmail.com>.
Hey Jean-Sebastien,

Thanks a lot for the input! I will update my proposal, and submit it  
to the GSOC web app.

I am about to start the drive from the bay area down to San Diego, so  
I will not have access to the internet again until late tonight.

The management layer sounds like a great idea. Definitely something I  
would like to work on in the future. I will investigate both approaches.

I will also check out skeleton and ORC.

Thanks again,

Chris


On Mar 30, 2008, at 4:12 PM, Jean-Sebastien Delfino wrote:

> Chris Trezzo wrote:
>> This is a great idea. I think, if I understand you correctly,  
>> adding a fourth implementation type is intended to address this.  
>> The extra type is going to act something like an orchestrator,  
>> trying to intelligently manage the Map, Combine, and Reduce  
>> functions over distributed computing facilities and heterogeneous  
>> data sources. Like you said, this component could shape the  
>> deployment of computations based on things like cloud load, time of  
>> day, locality of resources and so on.
>> I should probably make this more clear in my proposal.
>> Thank you for the comment/suggestion!
>> Chris
>> On Mar 30, 2008, at 12:50 AM, Robert Burrell Donkin wrote:
>>> On Sun, Mar 30, 2008 at 5:06 AM, Chris Trezzo <ct...@gmail.com>  
>>> wrote:
>>>> Hello everyone,
>>>>
>>>> I have posted a rough draft proposal for the project entitled
>>>> "Simplify the development of Map/Reduce applications and their
>>>> integration with various sources of information."
>>>>
>>>> The draft is located here: http://www.cse.ucsd.edu/~ctrezzo/gsocapplication.html
>>>>
>>>> Any comments/suggestions would be highly appreciated.
>>>
>>> just throwing out an idea...
>>>
>>> but would it be possible/beneficial to wrap Map/Reduce resources  
>>> using
>>> SCAs the other way round as well?
>>>
>>> for example, take an abstract service which performs some possibly
>>> intensive analytic computation. at smaller scales or development,  
>>> the
>>> analytic components might be assembled into a simple web service
>>> running on a single container. at the the largest scales, the work  
>>> may
>>> need to be farmed out to one of a number of clouds.
>>>
>>> perhaps an active management layer might be able to make decisions  
>>> to
>>> route the processing to different possibly hetrogeneous resources
>>> based on data and meta-data (cloud load, time of day and so on) .  
>>> for
>>> example, during local night these computations might be directed  
>>> to a
>>> grid formed on general purpose PCs used during the day.
>>>
>>> (who usually just lurks...)
>>>
>>> - robert
>>>
>
> Looks pretty good to me!
>
> One comment: I think it would be good to introduce concrete use  
> cases / scenarios to help drive the development of the project, and  
> present them in a sentence or two in the proposal.
>
> You could start with some the existing Hadoop examples implemented  
> as SCA components, then a slightly bigger application showing the  
> benefits of reusing and wiring components - as counting words in a  
> big document is a little simplistic :) - and the integration of  
> external data sources, or invocation of SCA services with the output  
> of the map/reduce for example.
>
> For item (3) you could start by looking at the SCA interface types.  
> SCA components can use local interfaces or remote interfaces on  
> their services and references. Remote interfaces can cross a network  
> communication, local interfaces require the components to run in the  
> same JVM, classloader etc. You could start with that and use that  
> info to control where components run on the Hadoop cloud, components  
> with local interfaces would be packaged together, components with  
> remote interfaces could run on different nodes.
>
> Then once you have that running you could explore SCA policies,  
> other requirements of your components etc.
>
> I like Robert's idea too, I can imagine a management layer that  
> analyzes the SCA metadata, the shape and usage of the cloud and uses  
> some rules to decide the allocation of components and jobs. Sounds  
> really cool!
>
> Hope this helps.
> -- 
> Jean-Sebastien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
> For additional commands, e-mail: tuscany-dev-help@ws.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org


Re: GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Posted by Jean-Sebastien Delfino <js...@apache.org>.
Chris Trezzo wrote:
> This is a great idea. I think, if I understand you correctly, adding a 
> fourth implementation type is intended to address this. The extra type 
> is going to act something like an orchestrator, trying to intelligently 
> manage the Map, Combine, and Reduce functions over distributed computing 
> facilities and heterogeneous data sources. Like you said, this component 
> could shape the deployment of computations based on things like cloud 
> load, time of day, locality of resources and so on.
> 
> I should probably make this more clear in my proposal.
> 
> Thank you for the comment/suggestion!
> 
> Chris
> 
> 
> On Mar 30, 2008, at 12:50 AM, Robert Burrell Donkin wrote:
> 
>> On Sun, Mar 30, 2008 at 5:06 AM, Chris Trezzo <ct...@gmail.com> wrote:
>>> Hello everyone,
>>>
>>> I have posted a rough draft proposal for the project entitled
>>> "Simplify the development of Map/Reduce applications and their
>>> integration with various sources of information."
>>>
>>> The draft is located here: 
>>> http://www.cse.ucsd.edu/~ctrezzo/gsocapplication.html
>>>
>>> Any comments/suggestions would be highly appreciated.
>>
>> just throwing out an idea...
>>
>> but would it be possible/beneficial to wrap Map/Reduce resources using
>> SCAs the other way round as well?
>>
>> for example, take an abstract service which performs some possibly
>> intensive analytic computation. at smaller scales or development, the
>> analytic components might be assembled into a simple web service
>> running on a single container. at the the largest scales, the work may
>> need to be farmed out to one of a number of clouds.
>>
>> perhaps an active management layer might be able to make decisions to
>> route the processing to different possibly hetrogeneous resources
>> based on data and meta-data (cloud load, time of day and so on) . for
>> example, during local night these computations might be directed to a
>> grid formed on general purpose PCs used during the day.
>>
>> (who usually just lurks...)
>>
>> - robert
>>

Looks pretty good to me!

One comment: I think it would be good to introduce concrete use cases / 
scenarios to help drive the development of the project, and present them 
in a sentence or two in the proposal.

You could start with some the existing Hadoop examples implemented as 
SCA components, then a slightly bigger application showing the benefits 
of reusing and wiring components - as counting words in a big document 
is a little simplistic :) - and the integration of external data 
sources, or invocation of SCA services with the output of the map/reduce 
for example.

For item (3) you could start by looking at the SCA interface types. SCA 
components can use local interfaces or remote interfaces on their 
services and references. Remote interfaces can cross a network 
communication, local interfaces require the components to run in the 
same JVM, classloader etc. You could start with that and use that info 
to control where components run on the Hadoop cloud, components with 
local interfaces would be packaged together, components with remote 
interfaces could run on different nodes.

Then once you have that running you could explore SCA policies, other 
requirements of your components etc.

I like Robert's idea too, I can imagine a management layer that analyzes 
the SCA metadata, the shape and usage of the cloud and uses some rules 
to decide the allocation of components and jobs. Sounds really cool!

Hope this helps.
-- 
Jean-Sebastien

---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org


Re: GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Posted by Chris Trezzo <ct...@gmail.com>.
This is a great idea. I think, if I understand you correctly, adding a  
fourth implementation type is intended to address this. The extra type  
is going to act something like an orchestrator, trying to  
intelligently manage the Map, Combine, and Reduce functions over  
distributed computing facilities and heterogeneous data sources. Like  
you said, this component could shape the deployment of computations  
based on things like cloud load, time of day, locality of resources  
and so on.

I should probably make this more clear in my proposal.

Thank you for the comment/suggestion!

Chris


On Mar 30, 2008, at 12:50 AM, Robert Burrell Donkin wrote:

> On Sun, Mar 30, 2008 at 5:06 AM, Chris Trezzo <ct...@gmail.com>  
> wrote:
>> Hello everyone,
>>
>> I have posted a rough draft proposal for the project entitled
>> "Simplify the development of Map/Reduce applications and their
>> integration with various sources of information."
>>
>> The draft is located here: http://www.cse.ucsd.edu/~ctrezzo/gsocapplication.html
>>
>> Any comments/suggestions would be highly appreciated.
>
> just throwing out an idea...
>
> but would it be possible/beneficial to wrap Map/Reduce resources using
> SCAs the other way round as well?
>
> for example, take an abstract service which performs some possibly
> intensive analytic computation. at smaller scales or development, the
> analytic components might be assembled into a simple web service
> running on a single container. at the the largest scales, the work may
> need to be farmed out to one of a number of clouds.
>
> perhaps an active management layer might be able to make decisions to
> route the processing to different possibly hetrogeneous resources
> based on data and meta-data (cloud load, time of day and so on) . for
> example, during local night these computations might be directed to a
> grid formed on general purpose PCs used during the day.
>
> (who usually just lurks...)
>
> - robert
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
> For additional commands, e-mail: tuscany-dev-help@ws.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org


Re: GSoc08 Application: Hadoop Map/Reduce SCA Integration Project

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Sun, Mar 30, 2008 at 5:06 AM, Chris Trezzo <ct...@gmail.com> wrote:
> Hello everyone,
>
>  I have posted a rough draft proposal for the project entitled
>  "Simplify the development of Map/Reduce applications and their
>  integration with various sources of information."
>
>  The draft is located here: http://www.cse.ucsd.edu/~ctrezzo/gsocapplication.html
>
>  Any comments/suggestions would be highly appreciated.

just throwing out an idea...

but would it be possible/beneficial to wrap Map/Reduce resources using
SCAs the other way round as well?

for example, take an abstract service which performs some possibly
intensive analytic computation. at smaller scales or development, the
analytic components might be assembled into a simple web service
running on a single container. at the the largest scales, the work may
need to be farmed out to one of a number of clouds.

perhaps an active management layer might be able to make decisions to
route the processing to different possibly hetrogeneous resources
based on data and meta-data (cloud load, time of day and so on) . for
example, during local night these computations might be directed to a
grid formed on general purpose PCs used during the day.

(who usually just lurks...)

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: tuscany-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: tuscany-dev-help@ws.apache.org