You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Thiago Moraes <th...@cmoraes.com> on 2011/08/15 02:38:17 UTC

Is Hadoop suitable for my data distribution problem?

Hey guys,

I'm new on the list and I'm currently considering Hadoop to solve a data
distribution problem. Right now, there's a server which contains very large
files (usual files have 30GB or even more). This server is accessed through
LAN and over the internet but, of course, it's difficult to do this without
local connection.

My idea to solve this problem is to deploy new servers on the places which
access data more often in an such a way that they get a local copy of the
files most accessed by then. These new servers would download and store
parts of the data (entire files) so that they can be accessed through their
own LAN alone, without needing to relieve on another server's data. Is it
possible to have this kind of limitation when splitting a file through
Hadoop's nodes?

In reality I don't know even if this restriction is useful. In my head,
enforcing this kind of data locality would make possible to use data
internally even if there is no internet connection, at the price of limiting
the number of nodes to balance the load and replication possibilities. Is
this tradeoff acceptable or at least possible with Hadoop?

thanks,

Thiago Moraes - EnC 07 - UFSCar