You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Aditya Kumar <ad...@yahoo.com> on 2011/09/04 05:06:19 UTC

Hadoop real time

Hi,

I am reading an article on Hadoop and hbase.
Can any one explain me the following statements..

-> Hadoop is not used for real time processing?
I think we should be able to use it for Real time processing also...

Can you please tell me why Hadoop is said not to be used for Real time processing of data?

--> Hadoop Augments Existing data bases?
I as thinking we can replace the DB with Hdoop...I do not  see any issue?

Re: Hadoop real time

Posted by Ted Dunning <td...@maprtech.com>.

There are additional off-shoots of Hadoop that can specifically address
real-time needs such as Spark, S4 and Hstreaming.

Most real-time-ish applications come, however, with a 100% uptime guarantee.
 Most simply put, a system that is down and is going to take 10's to 100's
of minutes to come back is going to miss a lot of real-time windows.

As such, you may need to investigate derivatives of Hadoop that explicitly
support high availability.

On Sat, Sep 3, 2011 at 11:38 PM, Jacques <wh...@gmail.com> wrote:

> It is hard to reply to an article that you don't actually reference but
> I'll
> do my best.  Also, you don't define real-time so I'll consider it as being
> something that would come back within 1-2 seconds (e.g. an end user on a
> web
> site is waiting for the info).
>
> >>Can you please tell me why Hadoop is said not to be used for Real time
> processing of data?
>
> There are two different parts to the core  Hadoop project.  Both of these
> are focused more on batch processing by themselves as opposed to real time
> workflows.
> 1. HDFS, a distributed file system that is good at safely managing a large
> quantity of very large files.  Generally speaking, Hadoop is a write once
> file system.  You can't modify the middle of a file after it is written.
>  You also can't append to the end of a file without a special version of
> Hadoop.  Also, you can't tail a file directly as it is being written.  As
> such, it would be hard to use it directly to create a real-time work flow.
>
> 2. MapReduce is a distributed computing framework.  It is used to process
> those large files held on HDFS.  Because of the design of MapReduce, jobs
> usually take at least 10 seconds and typically much longer. This would also
> mean you're looking at batch processing large quantities of data in some
> non-real-time period.
>
> HBase, is a separate, sub-project from the Hadoop project proper.  It is
> built specifically to handle real time loads.  You can insert a row and get
> it back immediately.
>
>  >I was thinking we can replace the DB with Hadoop...I do not  see any
> issue?
>
> HBase can replace many of the functions of existing databases but should be
> used primarily when you need the massive scale it can provide.  You have to
> give up things like transactions and SQL to HBase when compared to
> traditional RDBMS's (Mysql, PostreSQL, etc).  The schema design is very
> different and generally your application must be built with this in mind.
>  You should probably spend some time with the HBase book (
> http://hbase.apache.org/book.html) and looking at your current
> applications
> to determine what kinds of things you would need to do.  Many people
> actually use HBase in parallel with a traditional RDBMS, leveraging the
> strengths of each.
>
> Good luck!
>

Re: Hadoop real time

Posted by Jacques <wh...@gmail.com>.

It is hard to reply to an article that you don't actually reference but I'll
do my best.  Also, you don't define real-time so I'll consider it as being
something that would come back within 1-2 seconds (e.g. an end user on a web
site is waiting for the info).

>>Can you please tell me why Hadoop is said not to be used for Real time
processing of data?

There are two different parts to the core  Hadoop project.  Both of these
are focused more on batch processing by themselves as opposed to real time
workflows.
1. HDFS, a distributed file system that is good at safely managing a large
quantity of very large files.  Generally speaking, Hadoop is a write once
file system.  You can't modify the middle of a file after it is written.
 You also can't append to the end of a file without a special version of
Hadoop.  Also, you can't tail a file directly as it is being written.  As
such, it would be hard to use it directly to create a real-time work flow.

2. MapReduce is a distributed computing framework.  It is used to process
those large files held on HDFS.  Because of the design of MapReduce, jobs
usually take at least 10 seconds and typically much longer. This would also
mean you're looking at batch processing large quantities of data in some
non-real-time period.

HBase, is a separate, sub-project from the Hadoop project proper.  It is
built specifically to handle real time loads.  You can insert a row and get
it back immediately.

 >I was thinking we can replace the DB with Hadoop...I do not  see any
issue?

HBase can replace many of the functions of existing databases but should be
used primarily when you need the massive scale it can provide.  You have to
give up things like transactions and SQL to HBase when compared to
traditional RDBMS's (Mysql, PostreSQL, etc).  The schema design is very
different and generally your application must be built with this in mind.
 You should probably spend some time with the HBase book (
http://hbase.apache.org/book.html) and looking at your current applications
to determine what kinds of things you would need to do.  Many people
actually use HBase in parallel with a traditional RDBMS, leveraging the
strengths of each.

Good luck!