You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Richard Whitehead <ri...@ieee.org> on 2016/07/18 16:17:20 UTC

Building a distributed system

Hello, 

I wonder if the community can help me get started.

I’m trying to design the architecture of a project and I think that using some Apache Hadoop technologies may make sense, but I am completely new to distributed systems and to Apache (I am a very experienced developer, but my expertise is image processing on Windows!).

The task is very simple: call 3 or 4 executables in sequence to process some data.  The data is just a simple image and the processing takes tens of minutes.

We are considering a distributed architecture to increase throughput (latency does not matter).  So we need a way to queue work on remote computers, and a way to move the data around.  The architecture will have to work n a single server, or on a couple of servers in a rack, or in the cloud; 2 or 3 computers maximum.

Being new to all this I would prefer something simple rather than something super-powerful.

I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m assuming MapReduce would be over the top, is that the case?    

Thanks in advance.

Richard

Re: Building a distributed system

Posted by Marcin Tustin <mt...@handybook.com>.

Perhaps there is. Note that there are a bunch of java job queues. HDFS
sounds like it might be a nice way to share the data. Yarn or Mesos might
be a nice way to schedule the running of the jobs, but it sounds like you
could use any orchestration system to run the worker processes, and just
have them talk to the job queue.

This honestly doesn't sound like a problem where I would reach for hadoop
execution technologies first.

Also, if you want to be in JVM land, a lot of projects use scala, which
because of its type system might be even easier to validate. That would
also make using spark a lot nicer, if you really do decide you want
hadoopish style execution.

Also also, spark could potentially serve the place of a job queue running
on top of yarn, mesos, or its own cluster manager.

On Tue, Jul 19, 2016 at 4:48 AM, Richard Whitehead <
richard.whitehead@ieee.org> wrote:

> Thanks Ravi and Marcin,
>
> You are right, what we need is a work queue, a way to start jobs on remote
> machines, and a way to move data to and from those remote machines.   The
> “jobs” are just executables that process one item of data.  We don’t need
> to split the data into chunks or to combine the results from several jobs.
>
> The feeling amongst the developers seems to be that Java would be
> preferable to Python (this is a medical product and people, whether rightly
> or wrongly, think Java would be easier to validate).
>
> Is there a way to use the Hadoop (or some other) infrastructure in a
> simple way to prevent us having to write a scheduler, database schema
> etc.?  We can do that but it seems to be solving a problem that has already
> been solved many times.
>
> Thanks again,
>
> Richard
>
>
> *From:* Ravi Prakash <ra...@gmail.com>
> *Sent:* Monday, July 18, 2016 7:45 PM
> *To:* Marcin Tustin <mt...@handybook.com>
> *Cc:* Richard Whitehead <ri...@ieee.org> ;
> user@hadoop.apache.org
> *Subject:* Re: Building a distributed system
>
> Welcome to the community Richard!
>
> I suspect Hadoop can be more useful than just splitting and stitching back
> data. Depending on your use cases, it may come in handy to manage your
> machines, restart failed tasks, scheduling work when data becomes available
> etc. I wouldn't necessarily count it out. I'm sorry I am not familiar with
> celery, so I can't provide a direct comparison. Also, in the non-rare
> chance that your input data grows, you wouldn't have to rewrite your
> infrastructure code if you wrote your Hadoop code properly.
>
> HTH
> Ravi
>
> On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> I think you're confused as to what these things are.
>>
>> The fundamental question is do you want to run one job on sub parts of
>> the data, then stitch their results together (in which case
>> hive/map-reduce/spark will be for you), or do you essentially already have
>> splitting to computer-sized chunks figured out, and you just need a work
>> queue? In the latter case there are a number of alternatives. I happen to
>> like python, and would recommend celery (potentially wrapped by something
>> like airflow) for that case.
>>
>> On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <
>> richard.whitehead@ieee.org> wrote:
>>
>>> Hello,
>>>
>>> I wonder if the community can help me get started.
>>>
>>> I’m trying to design the architecture of a project and I think that
>>> using some Apache Hadoop technologies may make sense, but I am completely
>>> new to distributed systems and to Apache (I am a very experienced
>>> developer, but my expertise is image processing on Windows!).
>>>
>>> The task is very simple: call 3 or 4 executables in sequence to process
>>> some data.  The data is just a simple image and the processing takes tens
>>> of minutes.
>>>
>>> We are considering a distributed architecture to increase throughput
>>> (latency does not matter).  So we need a way to queue work on remote
>>> computers, and a way to move the data around.  The architecture will have
>>> to work n a single server, or on a couple of servers in a rack, or in the
>>> cloud; 2 or 3 computers maximum.
>>>
>>> Being new to all this I would prefer something simple rather than
>>> something super-powerful.
>>>
>>> I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m
>>> assuming MapReduce would be over the top, is that the case?
>>>
>>> Thanks in advance.
>>>
>>> Richard
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>> led by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity

Re: Building a distributed system

Posted by Richard Whitehead <ri...@ieee.org>.

Thanks Mirko,

I simply don’t understand enough to get going with this.  The documentation dives into details so fast but I don’t really understand the basics of using Hadoop.  I want to run a process that takes a file as input and gives a file as output, that seems pretty straightforward but I can’t tell if that’s supported.

I think we are going to drop this for now, the barrier to getting started is so high.

Thanks a lot for your help,

Richard


From: Mirko Kämpf 
Sent: Tuesday, July 19, 2016 9:53 AM
To: Richard Whitehead 
Subject: Re: Building a distributed system

Hello Richard, 

in order to process individual "items of data" it is pretty easy to use the Hadoop streaming API, assuming you have placed the data in HDFS, 
all the distribution of "items" comes also for free as part of the infrastructure.

Have a look at this document:
https://hadoop.apache.org/docs/r1.2.1/streaming.html

Best wishes,
Mirko


2016-07-19 10:48 GMT+02:00 Richard Whitehead <ri...@ieee.org>:

  Thanks Ravi and Marcin,

  You are right, what we need is a work queue, a way to start jobs on remote machines, and a way to move data to and from those remote machines.   The “jobs” are just executables that process one item of data.  We don’t need to split the data into chunks or to combine the results from several jobs.

  The feeling amongst the developers seems to be that Java would be preferable to Python (this is a medical product and people, whether rightly or wrongly, think Java would be easier to validate).  

  Is there a way to use the Hadoop (or some other) infrastructure in a simple way to prevent us having to write a scheduler, database schema etc.?  We can do that but it seems to be solving a problem that has already been solved many times.

  Thanks again,

  Richard


  From: Ravi Prakash 
  Sent: Monday, July 18, 2016 7:45 PM
  To: Marcin Tustin 
  Cc: Richard Whitehead ; user@hadoop.apache.org 
  Subject: Re: Building a distributed system

  Welcome to the community Richard!


  I suspect Hadoop can be more useful than just splitting and stitching back data. Depending on your use cases, it may come in handy to manage your machines, restart failed tasks, scheduling work when data becomes available etc. I wouldn't necessarily count it out. I'm sorry I am not familiar with celery, so I can't provide a direct comparison. Also, in the non-rare chance that your input data grows, you wouldn't have to rewrite your infrastructure code if you wrote your Hadoop code properly.


  HTH

  Ravi


  On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mt...@handybook.com> wrote:

    I think you're confused as to what these things are.  

    The fundamental question is do you want to run one job on sub parts of the data, then stitch their results together (in which case hive/map-reduce/spark will be for you), or do you essentially already have splitting to computer-sized chunks figured out, and you just need a work queue? In the latter case there are a number of alternatives. I happen to like python, and would recommend celery (potentially wrapped by something like airflow) for that case. 

    On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <ri...@ieee.org> wrote:

      Hello, 

      I wonder if the community can help me get started.

      I’m trying to design the architecture of a project and I think that using some Apache Hadoop technologies may make sense, but I am completely new to distributed systems and to Apache (I am a very experienced developer, but my expertise is image processing on Windows!).

      The task is very simple: call 3 or 4 executables in sequence to process some data.  The data is just a simple image and the processing takes tens of minutes.

      We are considering a distributed architecture to increase throughput (latency does not matter).  So we need a way to queue work on remote computers, and a way to move the data around.  The architecture will have to work n a single server, or on a couple of servers in a rack, or in the cloud; 2 or 3 computers maximum.

      Being new to all this I would prefer something simple rather than something super-powerful.

      I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m assuming MapReduce would be over the top, is that the case?    

      Thanks in advance.

      Richard


    Want to work at Handy? Check out our culture deck and open roles
    Latest news at Handy
    Handy just raised $50m led by Fidelity

Re: Building a distributed system

Posted by Richard Whitehead <ri...@ieee.org>.

Thanks Ravi and Marcin,

You are right, what we need is a work queue, a way to start jobs on remote machines, and a way to move data to and from those remote machines.   The “jobs” are just executables that process one item of data.  We don’t need to split the data into chunks or to combine the results from several jobs.

The feeling amongst the developers seems to be that Java would be preferable to Python (this is a medical product and people, whether rightly or wrongly, think Java would be easier to validate).  

Is there a way to use the Hadoop (or some other) infrastructure in a simple way to prevent us having to write a scheduler, database schema etc.?  We can do that but it seems to be solving a problem that has already been solved many times.

Thanks again,

Richard


From: Ravi Prakash 
Sent: Monday, July 18, 2016 7:45 PM
To: Marcin Tustin 
Cc: Richard Whitehead ; user@hadoop.apache.org 
Subject: Re: Building a distributed system

Welcome to the community Richard!


I suspect Hadoop can be more useful than just splitting and stitching back data. Depending on your use cases, it may come in handy to manage your machines, restart failed tasks, scheduling work when data becomes available etc. I wouldn't necessarily count it out. I'm sorry I am not familiar with celery, so I can't provide a direct comparison. Also, in the non-rare chance that your input data grows, you wouldn't have to rewrite your infrastructure code if you wrote your Hadoop code properly.


HTH

Ravi


On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mt...@handybook.com> wrote:

  I think you're confused as to what these things are.  

  The fundamental question is do you want to run one job on sub parts of the data, then stitch their results together (in which case hive/map-reduce/spark will be for you), or do you essentially already have splitting to computer-sized chunks figured out, and you just need a work queue? In the latter case there are a number of alternatives. I happen to like python, and would recommend celery (potentially wrapped by something like airflow) for that case. 

  On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <ri...@ieee.org> wrote:

    Hello, 

    I wonder if the community can help me get started.

    I’m trying to design the architecture of a project and I think that using some Apache Hadoop technologies may make sense, but I am completely new to distributed systems and to Apache (I am a very experienced developer, but my expertise is image processing on Windows!).

    The task is very simple: call 3 or 4 executables in sequence to process some data.  The data is just a simple image and the processing takes tens of minutes.

    We are considering a distributed architecture to increase throughput (latency does not matter).  So we need a way to queue work on remote computers, and a way to move the data around.  The architecture will have to work n a single server, or on a couple of servers in a rack, or in the cloud; 2 or 3 computers maximum.

    Being new to all this I would prefer something simple rather than something super-powerful.

    I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m assuming MapReduce would be over the top, is that the case?    

    Thanks in advance.

    Richard


  Want to work at Handy? Check out our culture deck and open roles
  Latest news at Handy
  Handy just raised $50m led by Fidelity

Re: Building a distributed system

Posted by Ravi Prakash <ra...@gmail.com>.

Welcome to the community Richard!

I suspect Hadoop can be more useful than just splitting and stitching back
data. Depending on your use cases, it may come in handy to manage your
machines, restart failed tasks, scheduling work when data becomes available
etc. I wouldn't necessarily count it out. I'm sorry I am not familiar with
celery, so I can't provide a direct comparison. Also, in the non-rare
chance that your input data grows, you wouldn't have to rewrite your
infrastructure code if you wrote your Hadoop code properly.

HTH
Ravi

On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mt...@handybook.com>
wrote:

> I think you're confused as to what these things are.
>
> The fundamental question is do you want to run one job on sub parts of the
> data, then stitch their results together (in which case
> hive/map-reduce/spark will be for you), or do you essentially already have
> splitting to computer-sized chunks figured out, and you just need a work
> queue? In the latter case there are a number of alternatives. I happen to
> like python, and would recommend celery (potentially wrapped by something
> like airflow) for that case.
>
> On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <
> richard.whitehead@ieee.org> wrote:
>
>> Hello,
>>
>> I wonder if the community can help me get started.
>>
>> I’m trying to design the architecture of a project and I think that using
>> some Apache Hadoop technologies may make sense, but I am completely new to
>> distributed systems and to Apache (I am a very experienced developer, but
>> my expertise is image processing on Windows!).
>>
>> The task is very simple: call 3 or 4 executables in sequence to process
>> some data.  The data is just a simple image and the processing takes tens
>> of minutes.
>>
>> We are considering a distributed architecture to increase throughput
>> (latency does not matter).  So we need a way to queue work on remote
>> computers, and a way to move the data around.  The architecture will have
>> to work n a single server, or on a couple of servers in a rack, or in the
>> cloud; 2 or 3 computers maximum.
>>
>> Being new to all this I would prefer something simple rather than
>> something super-powerful.
>>
>> I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m
>> assuming MapReduce would be over the top, is that the case?
>>
>> Thanks in advance.
>>
>> Richard
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: Building a distributed system

Posted by Marcin Tustin <mt...@handybook.com>.

I think you're confused as to what these things are.

The fundamental question is do you want to run one job on sub parts of the
data, then stitch their results together (in which case
hive/map-reduce/spark will be for you), or do you essentially already have
splitting to computer-sized chunks figured out, and you just need a work
queue? In the latter case there are a number of alternatives. I happen to
like python, and would recommend celery (potentially wrapped by something
like airflow) for that case.

On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <
richard.whitehead@ieee.org> wrote:

> Hello,
>
> I wonder if the community can help me get started.
>
> I’m trying to design the architecture of a project and I think that using
> some Apache Hadoop technologies may make sense, but I am completely new to
> distributed systems and to Apache (I am a very experienced developer, but
> my expertise is image processing on Windows!).
>
> The task is very simple: call 3 or 4 executables in sequence to process
> some data.  The data is just a simple image and the processing takes tens
> of minutes.
>
> We are considering a distributed architecture to increase throughput
> (latency does not matter).  So we need a way to queue work on remote
> computers, and a way to move the data around.  The architecture will have
> to work n a single server, or on a couple of servers in a rack, or in the
> cloud; 2 or 3 computers maximum.
>
> Being new to all this I would prefer something simple rather than
> something super-powerful.
>
> I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m
> assuming MapReduce would be over the top, is that the case?
>
> Thanks in advance.
>
> Richard
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity