You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vadim Zaliva <kr...@gmail.com> on 2008/06/23 23:31:03 UTC

realtime hadoop

Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim

Re: realtime hadoop

Posted by "Ian Holsman (Lists)" <li...@holsman.net>.

Fernando Padilla wrote:
> One use case I have a question about, is using Hadoop to power a web
> search or other query. So the full job should be done in under a second,
> from start to finish.

I don't think you should be using hadoop to answer the results of a 
user's search query.
you should be looking at things like SOLR (with the distributed patch), 
or CloudDB/Mysql Clusters.

some good research has also been done on this..
see CRUSH by Sage Weil :- www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

or the work on Chord# for wikipedia called 'onscale' :-
http://onscale.de/onscaledb.html

both would be better suited for OLTP type operations I would think.

>
> You know, you have a huge datastore, and you have to run a query against
> that, implemented as a MR query. Is there a way to optimize that use
> case, where the code doesn't change, but maybe the input parameters of
> the job? So a MR job could reuse the java code, and even the same JVM to
> avoid all of the startup costs..
>
> <digression>
> I bet hadoop isn't built for that yet (and enough reasons not to support
> it yet).. but maybe it's a usecase that shouldn't be totally ignored.
>
> And if you think about it, this is similar to what HBase is doing, at
> least the query execution part.. A dedicated MR daemon running ontop of
> the Hadoop infrastructure, so you don't incur the cost of distributing
> and starting fresh MR/JVM processes across the cluster.. maybe someone
> would want to refactor this thought process a little bit..
> </digression>
>
>
> Matt Kent wrote:
>> We use Hadoop in a similar manner, to process batches of data in
>> real-time every few minutes. However, we do substantial amounts of
>> processing on that data, so we use Hadoop to distribute our computation.
>> Unless you have a significant amount of work to be done, I wouldn't
>> recommend using Hadoop because it's not worth the overhead of launching
>> the jobs and moving the data around.
>>
>> Matt
>>
>> On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
>>> Interesting.
>>> we are planning on using hadoop to provide 'near' real time log
>>> analysis. we plan on having files close every 5 minutes (1 per log
>>> machine, so 80 files every 5 minutes) and then have a m/r to merge it
>>> into a single file that will get processed by other jobs later on.
>>>
>>> do you think this will namespace will explode?
>>>
>>> I wasn't thinking of clouddb.. it might be an interesting alternative
>>> once it is a bit more stable.
>>>
>>> regards
>>> Ian
>>>
>>> Stefan Groschupf wrote:
>>>> Hadoop might be the wrong technology for you.
>>>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>>>> since to access your data you need to close the file - means you might
>>>> have many small file, a situation where hdfs is not very strong
>>>> (namespace is hold in memory).
>>>> Hbase might be an interesting tool for you, also zookeeper if you want
>>>> to do something home grown...
>>>>
>>>>
>>>>
>>>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I am considering using Hadoop for (almost) realime data processing. I
>>>>> have data coming every second and I would like to use hadoop cluster
>>>>> to process
>>>>> it as fast as possible. I need to be able to maintain some guaranteed
>>>>> max. processing time, for example under 3 minutes.
>>>>>
>>>>> Does anybody have experience with using Hadoop in such manner? I will
>>>>> appreciate if you can share your experience or give me pointers
>>>>> to some articles or pages on the subject.
>>>>>
>>>>> Vadim
>>>>>
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> 101tec Inc.
>>>> Menlo Park, California, USA
>>>> http://www.101tec.com
>>>>
>>>>
>>

Re: realtime hadoop

Posted by Fernando Padilla <fe...@alum.mit.edu>.

One use case I have a question about, is using Hadoop to power a web 
search or other query.  So the full job should be done in under a 
second, from start to finish.

You know, you have a huge datastore, and you have to run a query against 
that, implemented as a MR query.  Is there a way to optimize that use 
case, where the code doesn't change, but maybe the input parameters of 
the job?  So a MR job could reuse the java code, and even the same JVM 
to avoid all of the startup costs..

<digression>
I bet hadoop isn't built for that yet (and enough reasons not to support 
it yet).. but maybe it's a usecase that shouldn't be totally ignored.

And if you think about it, this is similar to what HBase is doing, at 
least the query execution part.. A dedicated MR daemon running ontop of 
the Hadoop infrastructure, so you don't incur the cost of distributing 
and starting fresh MR/JVM processes across the cluster..  maybe someone 
would want to refactor this thought process a little bit..
</digression>

Matt Kent wrote:
> We use Hadoop in a similar manner, to process batches of data in
> real-time every few minutes. However, we do substantial amounts of
> processing on that data, so we use Hadoop to distribute our computation.
> Unless you have a significant amount of work to be done, I wouldn't
> recommend using Hadoop because it's not worth the overhead of launching
> the jobs and moving the data around.
> 
> Matt
> 
> On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
>> Interesting.
>> we are planning on using hadoop to provide 'near' real time log 
>> analysis. we plan on having files close every 5 minutes (1 per log 
>> machine, so 80 files every 5 minutes) and then have a m/r to merge it 
>> into a single file that will get processed by other jobs later on.
>>
>> do you think this will namespace will explode?
>>
>> I wasn't thinking of clouddb.. it might be an interesting alternative 
>> once it is a bit more stable.
>>
>> regards
>> Ian
>>
>> Stefan Groschupf wrote:
>>> Hadoop might be the wrong technology for you.
>>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>>> since to access your data you need to close the file - means you might
>>> have many small file, a situation where hdfs is not very strong
>>> (namespace is hold in memory).
>>> Hbase might be an interesting tool for you, also zookeeper if you want
>>> to do something home grown...
>>>
>>>
>>>
>>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>>
>>>> Hi!
>>>>
>>>> I am considering using Hadoop for (almost) realime data processing. I
>>>> have data coming every second and I would like to use hadoop cluster
>>>> to process
>>>> it as fast as possible. I need to be able to maintain some guaranteed
>>>> max. processing time, for example under 3 minutes.
>>>>
>>>> Does anybody have experience with using Hadoop in such manner? I will
>>>> appreciate if you can share your experience or give me pointers
>>>> to some articles or pages on the subject.
>>>>
>>>> Vadim
>>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 101tec Inc.
>>> Menlo Park, California, USA
>>> http://www.101tec.com
>>>
>>>
>

Re: realtime hadoop

Posted by Chris K Wensel <ch...@wensel.net>.

On Jun 23, 2008, at 9:54 PM, Matt Kent wrote:

> Unless you have a significant amount of work to be done, I wouldn't
> recommend using Hadoop because it's not worth the overhead of  
> launching
> the jobs and moving the data around.

I think part of the tradeoff is having a system that is resilient to  
failure against work that must get done, regardless of the amount of  
work.

ckw

--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: realtime hadoop

Posted by Matt Kent <ma...@persai.com>.

We wrote some custom tools that poll for new data and launch jobs
periodically.

Matt

On Tue, 2008-06-24 at 09:27 -0700, Vadim Zaliva wrote:
> Matt,
> 
> How do you manage your tasks? Do you lauch them periodically or keep
> them somehow running and feed them data?
> 
> Vadim
> 
> 
> On Mon, Jun 23, 2008 at 21:54, Matt Kent <ma...@persai.com> wrote:
> > We use Hadoop in a similar manner, to process batches of data in
> > real-time every few minutes. However, we do substantial amounts of
> > processing on that data, so we use Hadoop to distribute our computation.
> > Unless you have a significant amount of work to be done, I wouldn't
> > recommend using Hadoop because it's not worth the overhead of launching
> > the jobs and moving the data around.
> >
> > Matt
> >
> > On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
> >> Interesting.
> >> we are planning on using hadoop to provide 'near' real time log
> >> analysis. we plan on having files close every 5 minutes (1 per log
> >> machine, so 80 files every 5 minutes) and then have a m/r to merge it
> >> into a single file that will get processed by other jobs later on.
> >>
> >> do you think this will namespace will explode?
> >>
> >> I wasn't thinking of clouddb.. it might be an interesting alternative
> >> once it is a bit more stable.
> >>
> >> regards
> >> Ian
> >>
> >> Stefan Groschupf wrote:
> >> > Hadoop might be the wrong technology for you.
> >> > Map Reduce is a batch processing mechanism. Also HDFS might be critical
> >> > since to access your data you need to close the file - means you might
> >> > have many small file, a situation where hdfs is not very strong
> >> > (namespace is hold in memory).
> >> > Hbase might be an interesting tool for you, also zookeeper if you want
> >> > to do something home grown...
> >> >
> >> >
> >> >
> >> > On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
> >> >
> >> >> Hi!
> >> >>
> >> >> I am considering using Hadoop for (almost) realime data processing. I
> >> >> have data coming every second and I would like to use hadoop cluster
> >> >> to process
> >> >> it as fast as possible. I need to be able to maintain some guaranteed
> >> >> max. processing time, for example under 3 minutes.
> >> >>
> >> >> Does anybody have experience with using Hadoop in such manner? I will
> >> >> appreciate if you can share your experience or give me pointers
> >> >> to some articles or pages on the subject.
> >> >>
> >> >> Vadim
> >> >>
> >> >
> >> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > 101tec Inc.
> >> > Menlo Park, California, USA
> >> > http://www.101tec.com
> >> >
> >> >
> >>
> >
> >

Re: realtime hadoop

Posted by Vadim Zaliva <kr...@gmail.com>.

Matt,

How do you manage your tasks? Do you lauch them periodically or keep
them somehow running and feed them data?

Vadim


On Mon, Jun 23, 2008 at 21:54, Matt Kent <ma...@persai.com> wrote:
> We use Hadoop in a similar manner, to process batches of data in
> real-time every few minutes. However, we do substantial amounts of
> processing on that data, so we use Hadoop to distribute our computation.
> Unless you have a significant amount of work to be done, I wouldn't
> recommend using Hadoop because it's not worth the overhead of launching
> the jobs and moving the data around.
>
> Matt
>
> On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
>> Interesting.
>> we are planning on using hadoop to provide 'near' real time log
>> analysis. we plan on having files close every 5 minutes (1 per log
>> machine, so 80 files every 5 minutes) and then have a m/r to merge it
>> into a single file that will get processed by other jobs later on.
>>
>> do you think this will namespace will explode?
>>
>> I wasn't thinking of clouddb.. it might be an interesting alternative
>> once it is a bit more stable.
>>
>> regards
>> Ian
>>
>> Stefan Groschupf wrote:
>> > Hadoop might be the wrong technology for you.
>> > Map Reduce is a batch processing mechanism. Also HDFS might be critical
>> > since to access your data you need to close the file - means you might
>> > have many small file, a situation where hdfs is not very strong
>> > (namespace is hold in memory).
>> > Hbase might be an interesting tool for you, also zookeeper if you want
>> > to do something home grown...
>> >
>> >
>> >
>> > On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>> >
>> >> Hi!
>> >>
>> >> I am considering using Hadoop for (almost) realime data processing. I
>> >> have data coming every second and I would like to use hadoop cluster
>> >> to process
>> >> it as fast as possible. I need to be able to maintain some guaranteed
>> >> max. processing time, for example under 3 minutes.
>> >>
>> >> Does anybody have experience with using Hadoop in such manner? I will
>> >> appreciate if you can share your experience or give me pointers
>> >> to some articles or pages on the subject.
>> >>
>> >> Vadim
>> >>
>> >
>> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > 101tec Inc.
>> > Menlo Park, California, USA
>> > http://www.101tec.com
>> >
>> >
>>
>
>

Re: realtime hadoop

Posted by "Ian Holsman (Lists)" <li...@holsman.net>.

Matt Kent wrote:
> We use Hadoop in a similar manner, to process batches of data in
> real-time every few minutes. However, we do substantial amounts of
> processing on that data, so we use Hadoop to distribute our computation.
> Unless you have a significant amount of work to be done, I wouldn't
> recommend using Hadoop because it's not worth the overhead of launching
> the jobs and moving the data around.

Thanks Matt.

we are boiling the ocean with the data so to speak.. so thats cool.
we are also looking at supplementing the m/r jobs with data coming in 
from spread to get the 'instant' analysis parts of our feedback systems.

Regards
Ian
>
> Matt
>
> On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
>> Interesting.
>> we are planning on using hadoop to provide 'near' real time log
>> analysis. we plan on having files close every 5 minutes (1 per log
>> machine, so 80 files every 5 minutes) and then have a m/r to merge it
>> into a single file that will get processed by other jobs later on.
>>
>> do you think this will namespace will explode?
>>
>> I wasn't thinking of clouddb.. it might be an interesting alternative
>> once it is a bit more stable.
>>
>> regards
>> Ian
>>
>> Stefan Groschupf wrote:
>>> Hadoop might be the wrong technology for you.
>>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>>> since to access your data you need to close the file - means you might
>>> have many small file, a situation where hdfs is not very strong
>>> (namespace is hold in memory).
>>> Hbase might be an interesting tool for you, also zookeeper if you want
>>> to do something home grown...
>>>
>>>
>>>
>>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>>
>>>> Hi!
>>>>
>>>> I am considering using Hadoop for (almost) realime data processing. I
>>>> have data coming every second and I would like to use hadoop cluster
>>>> to process
>>>> it as fast as possible. I need to be able to maintain some guaranteed
>>>> max. processing time, for example under 3 minutes.
>>>>
>>>> Does anybody have experience with using Hadoop in such manner? I will
>>>> appreciate if you can share your experience or give me pointers
>>>> to some articles or pages on the subject.
>>>>
>>>> Vadim
>>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 101tec Inc.
>>> Menlo Park, California, USA
>>> http://www.101tec.com
>>>
>>>
>

Re: realtime hadoop

Posted by Matt Kent <ma...@persai.com>.

We use Hadoop in a similar manner, to process batches of data in
real-time every few minutes. However, we do substantial amounts of
processing on that data, so we use Hadoop to distribute our computation.
Unless you have a significant amount of work to be done, I wouldn't
recommend using Hadoop because it's not worth the overhead of launching
the jobs and moving the data around.

Matt

On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:
> Interesting.
> we are planning on using hadoop to provide 'near' real time log 
> analysis. we plan on having files close every 5 minutes (1 per log 
> machine, so 80 files every 5 minutes) and then have a m/r to merge it 
> into a single file that will get processed by other jobs later on.
> 
> do you think this will namespace will explode?
> 
> I wasn't thinking of clouddb.. it might be an interesting alternative 
> once it is a bit more stable.
> 
> regards
> Ian
> 
> Stefan Groschupf wrote:
> > Hadoop might be the wrong technology for you.
> > Map Reduce is a batch processing mechanism. Also HDFS might be critical
> > since to access your data you need to close the file - means you might
> > have many small file, a situation where hdfs is not very strong
> > (namespace is hold in memory).
> > Hbase might be an interesting tool for you, also zookeeper if you want
> > to do something home grown...
> >
> >
> >
> > On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
> >
> >> Hi!
> >>
> >> I am considering using Hadoop for (almost) realime data processing. I
> >> have data coming every second and I would like to use hadoop cluster
> >> to process
> >> it as fast as possible. I need to be able to maintain some guaranteed
> >> max. processing time, for example under 3 minutes.
> >>
> >> Does anybody have experience with using Hadoop in such manner? I will
> >> appreciate if you can share your experience or give me pointers
> >> to some articles or pages on the subject.
> >>
> >> Vadim
> >>
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 101tec Inc.
> > Menlo Park, California, USA
> > http://www.101tec.com
> >
> >
>

Re: realtime hadoop

Posted by "Ian Holsman (Lists)" <li...@holsman.net>.

Interesting.
we are planning on using hadoop to provide 'near' real time log 
analysis. we plan on having files close every 5 minutes (1 per log 
machine, so 80 files every 5 minutes) and then have a m/r to merge it 
into a single file that will get processed by other jobs later on.

do you think this will namespace will explode?

I wasn't thinking of clouddb.. it might be an interesting alternative 
once it is a bit more stable.

regards
Ian

Stefan Groschupf wrote:
> Hadoop might be the wrong technology for you.
> Map Reduce is a batch processing mechanism. Also HDFS might be critical
> since to access your data you need to close the file - means you might
> have many small file, a situation where hdfs is not very strong
> (namespace is hold in memory).
> Hbase might be an interesting tool for you, also zookeeper if you want
> to do something home grown...
>
>
>
> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>
>> Hi!
>>
>> I am considering using Hadoop for (almost) realime data processing. I
>> have data coming every second and I would like to use hadoop cluster
>> to process
>> it as fast as possible. I need to be able to maintain some guaranteed
>> max. processing time, for example under 3 minutes.
>>
>> Does anybody have experience with using Hadoop in such manner? I will
>> appreciate if you can share your experience or give me pointers
>> to some articles or pages on the subject.
>>
>> Vadim
>>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> Menlo Park, California, USA
> http://www.101tec.com
>
>

Re: realtime hadoop

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.


Daniel wrote:
>>> Also HDFS might be critical since to access your data you need to close
>> the file
>>
>> Not anymore. Since 0.16 files are readable while being written to.
> 
> Does this mean i can open some file as map input and the reduce output ? So
> i can update the files instead of creating new ones.

No files are still write-once in hdfs, you cannot modify a file after it is closed.
But if it is not closed you can still write more data into it, and other clients will
be able to read this new data.

> Also if i want to do query in the records, should i rather use Hbase instead
> of HDFS? - say if we have large size of data stored as (key, value).

HDFS has file system api, there is no notion of a record in it, just files and bytes.
Depending on how you define a record you can use different systems including HBase and Pig.
These two work well for table-like data collections.
Or you can write your own MapReduce job to do processing of a big key-value dataset.

Regards,
--Konstantin

> Thanks.
> 
>>
>>>> it as fast as possible. I need to be able to maintain some guaranteed
>>>> max. processing time, for example under 3 minutes.
>> It looks like you do not need very strict guarantees.
>> I think you can use hdfs as a data-storage.
>> Don't know what kind of data-processing you do, but I agree with Stefan
>> that map-reduce is designed for batch tasks rather than for real-time
>> processing.
>>
>>
>>
>>
>> Stefan Groschupf wrote:
>>
>>> Hadoop might be the wrong technology for you.
>>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>>> since to access your data you need to close the file - means you might have
>>> many small file, a situation where hdfs is not very strong (namespace is
>>> hold in memory).
>>> Hbase might be an interesting tool for you, also zookeeper if you want to
>>> do something home grown...
>>>
>>>
>>>
>>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>>
>>>  Hi!
>>>> I am considering using Hadoop for (almost) realime data processing. I
>>>> have data coming every second and I would like to use hadoop cluster
>>>> to process
>>>> it as fast as possible. I need to be able to maintain some guaranteed
>>>> max. processing time, for example under 3 minutes.
>>>>
>>>> Does anybody have experience with using Hadoop in such manner? I will
>>>> appreciate if you can share your experience or give me pointers
>>>> to some articles or pages on the subject.
>>>>
>>>> Vadim
>>>>
>>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 101tec Inc.
>>> Menlo Park, California, USA
>>> http://www.101tec.com
>>>
>>>
>>>
>>>
>

Re: realtime hadoop

Posted by Daniel <d4...@gmail.com>.

2008/6/24 Konstantin Shvachko <sh...@yahoo-inc.com>:

> > Also HDFS might be critical since to access your data you need to close
> the file
>
> Not anymore. Since 0.16 files are readable while being written to.

Does this mean i can open some file as map input and the reduce output ? So
i can update the files instead of creating new ones.
Also if i want to do query in the records, should i rather use Hbase instead
of HDFS? - say if we have large size of data stored as (key, value).

Thanks.

>
>
> >> it as fast as possible. I need to be able to maintain some guaranteed
> >> max. processing time, for example under 3 minutes.
>
> It looks like you do not need very strict guarantees.
> I think you can use hdfs as a data-storage.
> Don't know what kind of data-processing you do, but I agree with Stefan
> that map-reduce is designed for batch tasks rather than for real-time
> processing.
>
>
>
>
> Stefan Groschupf wrote:
>
>> Hadoop might be the wrong technology for you.
>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>> since to access your data you need to close the file - means you might have
>> many small file, a situation where hdfs is not very strong (namespace is
>> hold in memory).
>> Hbase might be an interesting tool for you, also zookeeper if you want to
>> do something home grown...
>>
>>
>>
>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>
>>  Hi!
>>>
>>> I am considering using Hadoop for (almost) realime data processing. I
>>> have data coming every second and I would like to use hadoop cluster
>>> to process
>>> it as fast as possible. I need to be able to maintain some guaranteed
>>> max. processing time, for example under 3 minutes.
>>>
>>> Does anybody have experience with using Hadoop in such manner? I will
>>> appreciate if you can share your experience or give me pointers
>>> to some articles or pages on the subject.
>>>
>>> Vadim
>>>
>>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 101tec Inc.
>> Menlo Park, California, USA
>> http://www.101tec.com
>>
>>
>>
>>

Re: realtime hadoop

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

 > Also HDFS might be critical since to access your data you need to close the file

Not anymore. Since 0.16 files are readable while being written to.

 >> it as fast as possible. I need to be able to maintain some guaranteed
 >> max. processing time, for example under 3 minutes.

It looks like you do not need very strict guarantees.
I think you can use hdfs as a data-storage.
Don't know what kind of data-processing you do, but I agree with Stefan
that map-reduce is designed for batch tasks rather than for real-time processing.



Stefan Groschupf wrote:
> Hadoop might be the wrong technology for you.
> Map Reduce is a batch processing mechanism. Also HDFS might be critical 
> since to access your data you need to close the file - means you might 
> have many small file, a situation where hdfs is not very strong 
> (namespace is hold in memory).
> Hbase might be an interesting tool for you, also zookeeper if you want 
> to do something home grown...
> 
> 
> 
> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
> 
>> Hi!
>>
>> I am considering using Hadoop for (almost) realime data processing. I
>> have data coming every second and I would like to use hadoop cluster
>> to process
>> it as fast as possible. I need to be able to maintain some guaranteed
>> max. processing time, for example under 3 minutes.
>>
>> Does anybody have experience with using Hadoop in such manner? I will
>> appreciate if you can share your experience or give me pointers
>> to some articles or pages on the subject.
>>
>> Vadim
>>
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> Menlo Park, California, USA
> http://www.101tec.com
> 
> 
>

Re: realtime hadoop

Posted by Chris Anderson <jc...@grabb.it>.

Vadim,

Depending on the nature of your data, CouchDB (http://couchdb.org)
might be worth looking into. It speaks JSON natively, and has
real-time map/reduce support. The 0.8.0 release is imminent (don't
bother with 0.7.2), and the community is active. We're using it for
something similar to what you describe, and it's working well.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: realtime hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be  
critical since to access your data you need to close the file - means  
you might have many small file, a situation where hdfs is not very  
strong (namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want  
to do something home grown...

On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:

> Hi!
>
> I am considering using Hadoop for (almost) realime data processing. I
> have data coming every second and I would like to use hadoop cluster
> to process
> it as fast as possible. I need to be able to maintain some guaranteed
> max. processing time, for example under 3 minutes.
>
> Does anybody have experience with using Hadoop in such manner? I will
> appreciate if you can share your experience or give me pointers
> to some articles or pages on the subject.
>
> Vadim
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com