You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Adam Retter <Ad...@landmark.co.uk> on 2009/04/28 12:05:54 UTC

Appropriate for Hadoop?

If I understand correctly - Hadoop forms a general purpose cluster on
which you can execute jobs?

We have a Java data processing application here that follows the
Producer -> Consumer pattern. It has been written with threading as a
concern from the start using java.util.concurrent.Callable.

At present the producer is a thread that retrieves a list of document
URI's from a SQL query against databaseA and adds them to a shared
(synchronised) queue.

Each consumer is a thread, of which there can be n, but we typically run
with 16 on the current hardware.
The consumer sits in a loop, processing the queue until it is empty. It
removes a document URI from the shared queue, retrieves the document and
performs a pipeline of transformations on the document, resulting in a
series of 600 to 16000 SQL insert statements which are then executed
against databaseB.

I have been reading about both Terracotta and Hadoop. Hadoop appears the
more general purpose solution that we could use for many applications,
however I am not sure how our application would map onto Hadoop
concepts. I have been studying the Map/Reduce Hadoop approach but our
application does not produce any intermediate files that would be the
input/output to the Map/Reduce processes.

Any guidance would be appreciated, it may well be that our application
is not an appropriate use of Hadoop?


Thanks Adam.
 
Adam Retter
Software Developer
Landmark Information Group
 
T: 01392 685403 (x5403) 
 
5-7 Abbey Court, Eagle Way, Sowton,
Exeter, Devon, EX2 7HY
 
www.landmark.co.uk
 


Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England and Wales 

This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 

The information contained in this e-mail is confidential and may be subject to 
legal privilege. If you are not the intended recipient, you must not use, copy, 
distribute or disclose the e-mail or any part of its contents or take any 
action in reliance on it. If you have received this e-mail in error, please 
e-mail the sender by replying to this message. All reasonable precautions have 
been taken to ensure no viruses are present in this e-mail. Landmark Information
Group Limited cannot accept responsibility for loss or damage arising from the 
use of this e-mail or attachments and recommend that you subject these to 
your virus checking procedures prior to use.

Re: Appropriate for Hadoop?

Posted by Wang Zhong <wa...@gmail.com>.

Hi Adam,

It seems that producers and consumer work parallelly, so you can use
Hadoop to process your application. But the problem is the expense of
commucation with DB. You can refer to Ankur's thread with subject
'Hadoop / MySQL'.


Regards,


On Tue, Apr 28, 2009 at 6:05 PM, Adam Retter <Ad...@landmark.co.uk> wrote:
>
> If I understand correctly - Hadoop forms a general purpose cluster on
> which you can execute jobs?
>
> We have a Java data processing application here that follows the
> Producer -> Consumer pattern. It has been written with threading as a
> concern from the start using java.util.concurrent.Callable.
>
> At present the producer is a thread that retrieves a list of document
> URI's from a SQL query against databaseA and adds them to a shared
> (synchronised) queue.
>
> Each consumer is a thread, of which there can be n, but we typically run
> with 16 on the current hardware.
> The consumer sits in a loop, processing the queue until it is empty. It
> removes a document URI from the shared queue, retrieves the document and
> performs a pipeline of transformations on the document, resulting in a
> series of 600 to 16000 SQL insert statements which are then executed
> against databaseB.
>
> I have been reading about both Terracotta and Hadoop. Hadoop appears the
> more general purpose solution that we could use for many applications,
> however I am not sure how our application would map onto Hadoop
> concepts. I have been studying the Map/Reduce Hadoop approach but our
> application does not produce any intermediate files that would be the
> input/output to the Map/Reduce processes.
>
> Any guidance would be appreciated, it may well be that our application
> is not an appropriate use of Hadoop?
>
>
> Thanks Adam.
>
> Adam Retter
> Software Developer
> Landmark Information Group
>
> T: 01392 685403 (x5403)
>
> 5-7 Abbey Court, Eagle Way, Sowton,
> Exeter, Devon, EX2 7HY
>
> www.landmark.co.uk
>
>
>
> Registered Office: 7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
> Registered Number 2892803 Registered in England and Wales
>
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
>
> The information contained in this e-mail is confidential and may be subject to
> legal privilege. If you are not the intended recipient, you must not use, copy,
> distribute or disclose the e-mail or any part of its contents or take any
> action in reliance on it. If you have received this e-mail in error, please
> e-mail the sender by replying to this message. All reasonable precautions have
> been taken to ensure no viruses are present in this e-mail. Landmark Information
> Group Limited cannot accept responsibility for loss or damage arising from the
> use of this e-mail or attachments and recommend that you subject these to
> your virus checking procedures prior to use.
>
>



-- 
Wang Zhong