You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Hemant kulkarni <ku...@gmail.com> on 2011/10/07 00:35:42 UTC

Hadoop for unstructured data storage

Hi all,
We are a small software development firm working on data backup
software. We have a backup product which copies data from client
machine to data store. Currently we provide a specialized hardware to
store data(1-3TB disks and servers). We want to provide solution to
some customers(mining company) with following requirements
1] Huge data storage capacity(initially starting with 100 TB but
should be easy to increase)
2] Initially this facility is used as data storage but in future
company plans to add data processing software(some MapReduce jobs)
3] Most of data is unstructured (mostly images, text files and videos)
4] many times data is duplicate of some original. So need de duplication
5] Mostly data is added every time(daily backup) and occasionally
read.(Write every day new data and read on weekly)
6] data copied is in terms of files(every backup is 100,000 files each
file is some MB and some files in KB)
7] this is data storage so latency requirements are not very strict
8] Some part of data have very high HA requirements. Should be copied
to data centers outside country on timely basis(weekly, but data size
is small like few TB)
9]Currently we provide some sort of HSM(Hierarchical Storage
Management ). company needs something similar in new solution
10] Single namespace and versioning of files is another requirement

As I understood HDFS doesn't suit directly for such storage due to
following design consideration
1] Large no of small files
2] duplicate data
3] write many read once requirement

Here are my questions
1] Does DHFS support our client requirements? or at least can it be
configured to suit needs?
2] is there any customization of HDFS(if possible) which will serve the purpose

is there any other solution which will work?

All thoughts/suggestions are welcome

Regards,
Hemant.

Re: Hadoop for unstructured data storage

Posted by Ted Dunning <td...@maprtech.com>.
HDFS does not really meet your needs.  I think that MapR's solution would.
 I will contact off-line to give details.

On Thu, Oct 6, 2011 at 3:35 PM, Hemant kulkarni <ku...@gmail.com>wrote:

> Hi all,
> We are a small software development firm working on data backup
> software. We have a backup product which copies data from client
> machine to data store. Currently we provide a specialized hardware to
> store data(1-3TB disks and servers). We want to provide solution to
> some customers(mining company) with following requirements
> 1] Huge data storage capacity(initially starting with 100 TB but
> should be easy to increase)
> 2] Initially this facility is used as data storage but in future
> company plans to add data processing software(some MapReduce jobs)
> 3] Most of data is unstructured (mostly images, text files and videos)
> 4] many times data is duplicate of some original. So need de duplication
> 5] Mostly data is added every time(daily backup) and occasionally
> read.(Write every day new data and read on weekly)
> 6] data copied is in terms of files(every backup is 100,000 files each
> file is some MB and some files in KB)
> 7] this is data storage so latency requirements are not very strict
> 8] Some part of data have very high HA requirements. Should be copied
> to data centers outside country on timely basis(weekly, but data size
> is small like few TB)
> 9]Currently we provide some sort of HSM(Hierarchical Storage
> Management ). company needs something similar in new solution
> 10] Single namespace and versioning of files is another requirement
>
> As I understood HDFS doesn't suit directly for such storage due to
> following design consideration
> 1] Large no of small files
> 2] duplicate data
> 3] write many read once requirement
>
> Here are my questions
> 1] Does DHFS support our client requirements? or at least can it be
> configured to suit needs?
> 2] is there any customization of HDFS(if possible) which will serve the
> purpose
>
> is there any other solution which will work?
>
> All thoughts/suggestions are welcome
>
> Regards,
> Hemant.
>

Re: Hadoop for unstructured data storage

Posted by Ted Dunning <td...@maprtech.com>.
HDFS does not really meet your needs.  I think that MapR's solution would.
 I will contact off-line to give details.

On Thu, Oct 6, 2011 at 3:35 PM, Hemant kulkarni <ku...@gmail.com>wrote:

> Hi all,
> We are a small software development firm working on data backup
> software. We have a backup product which copies data from client
> machine to data store. Currently we provide a specialized hardware to
> store data(1-3TB disks and servers). We want to provide solution to
> some customers(mining company) with following requirements
> 1] Huge data storage capacity(initially starting with 100 TB but
> should be easy to increase)
> 2] Initially this facility is used as data storage but in future
> company plans to add data processing software(some MapReduce jobs)
> 3] Most of data is unstructured (mostly images, text files and videos)
> 4] many times data is duplicate of some original. So need de duplication
> 5] Mostly data is added every time(daily backup) and occasionally
> read.(Write every day new data and read on weekly)
> 6] data copied is in terms of files(every backup is 100,000 files each
> file is some MB and some files in KB)
> 7] this is data storage so latency requirements are not very strict
> 8] Some part of data have very high HA requirements. Should be copied
> to data centers outside country on timely basis(weekly, but data size
> is small like few TB)
> 9]Currently we provide some sort of HSM(Hierarchical Storage
> Management ). company needs something similar in new solution
> 10] Single namespace and versioning of files is another requirement
>
> As I understood HDFS doesn't suit directly for such storage due to
> following design consideration
> 1] Large no of small files
> 2] duplicate data
> 3] write many read once requirement
>
> Here are my questions
> 1] Does DHFS support our client requirements? or at least can it be
> configured to suit needs?
> 2] is there any customization of HDFS(if possible) which will serve the
> purpose
>
> is there any other solution which will work?
>
> All thoughts/suggestions are welcome
>
> Regards,
> Hemant.
>