You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Zachary Kozick <za...@omniar.com> on 2011/02/01 22:21:04 UTC

Hadoop / HDFS equalivant but for realtime request handling / small files?

Hi all,

I'm interested in creating a solution that leverages multiple computing
nodes in an EC2 or Rackspace cloud environment in order to
do massively parallelized processing in the context of serving HTTP
requests, meaning I want results to be aggregated within 1-4 seconds.

>From what I gather, Hadoop is designed for job-oriented tasks and the
minimum job completion time is 30 seconds.  Also HDFS is meant for storing
few large files, as opposed to many small files.

My question is there a framework similar to hadoop that is designed more for
on-demand parallel computing?  What about a technology similar to HDFS that
is better at moving around small files and making them available to slave
nodes on demand?

Re: Hadoop / HDFS equalivant but for realtime request handling / small files?

Posted by Michael Dalton <mw...@gmail.com>.
Hi Zachary,

HBase is rolling out Coprocessors in the 0.92 release, and that could be
used for more real-time computations with smaller files (e.g., HBase rows
are typically a few KB, up to 10MB in practice). Coprocessors allow you to
associate code with table regions in HBase, so you can scan region data on
startup and receive a stream of all get/put requests to the region to
maintain per-region analytics. Here's a blog post:
http://hbaseblog.com/2010/11/30/hbase-coprocessors/ and associated JIRA:
https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

<https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel>You
can also check out Yahoo!'s s4 project, but that's more about performing
analytics on continuous, unbounded streams of data than processing a large
number of small files: http://s4.io/

<http://s4.io/>Best,

Mike

On Tue, Feb 1, 2011 at 1:48 PM, Russ Ferriday <ru...@gmail.com>wrote:

> Hi Zachary,
>
> Have you heard of Cassandra?
> You may be able to write processing nodes accessing data on Cassandra.
> Probably the easiest configuration is that on each node you have processing
> functions and a Cassandra node.   Then as you expand your computing cluster,
> you also expand your cassandra bandwidth.
> This is not optimal, but very practical for a small project/small team.
> --r
>
>
>
> On Tue, Feb 1, 2011 at 1:21 PM, Zachary Kozick <za...@omniar.com> wrote:
>
>> Hi all,
>>
>> I'm interested in creating a solution that leverages multiple computing
>> nodes in an EC2 or Rackspace cloud environment in order to
>> do massively parallelized processing in the context of serving HTTP
>> requests, meaning I want results to be aggregated within 1-4 seconds.
>>
>> From what I gather, Hadoop is designed for job-oriented tasks and the
>> minimum job completion time is 30 seconds.  Also HDFS is meant for storing
>> few large files, as opposed to many small files.
>>
>> My question is there a framework similar to hadoop that is designed more
>> for on-demand parallel computing?  What about a technology similar to HDFS
>> that is better at moving around small files and making them available to
>> slave nodes on demand?
>>
>
>

Re: Hadoop / HDFS equalivant but for realtime request handling / small files?

Posted by Russ Ferriday <ru...@gmail.com>.
Hi Zachary,

Have you heard of Cassandra?
You may be able to write processing nodes accessing data on Cassandra.
Probably the easiest configuration is that on each node you have processing
functions and a Cassandra node.   Then as you expand your computing cluster,
you also expand your cassandra bandwidth.
This is not optimal, but very practical for a small project/small team.
--r


On Tue, Feb 1, 2011 at 1:21 PM, Zachary Kozick <za...@omniar.com> wrote:

> Hi all,
>
> I'm interested in creating a solution that leverages multiple computing
> nodes in an EC2 or Rackspace cloud environment in order to
> do massively parallelized processing in the context of serving HTTP
> requests, meaning I want results to be aggregated within 1-4 seconds.
>
> From what I gather, Hadoop is designed for job-oriented tasks and the
> minimum job completion time is 30 seconds.  Also HDFS is meant for storing
> few large files, as opposed to many small files.
>
> My question is there a framework similar to hadoop that is designed more
> for on-demand parallel computing?  What about a technology similar to HDFS
> that is better at moving around small files and making them available to
> slave nodes on demand?
>