You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matt Painter <ma...@deity.co.nz> on 2012/10/15 21:47:26 UTC

Suitability of HDFS for live file store

Hi,

I am a new Hadoop user, and would really appreciate your opinions on
whether Hadoop is the right tool for what I'm thinking of using it for.

I am investigating options for scaling an archive of around 100Tb of image
data. These images are typically TIFF files of around 50-100Mb each and
need to be made available online in realtime. Access to the files will be
sporadic and occasional, but writing the files will be a daily activity.
Speed of write is not particularly important.

Our previous solution was a monolithic, expensive - and very full - SAN so
I am excited by Hadoop's distributed, extensible, redundant architecture.

My concern is that a lot of the discussion on and use cases for Hadoop is
regarding data processing with MapReduce and - from what I understand -
using HDFS for the purpose of input for MapReduce jobs. My other concern is
vague indication that it's not a 'real-time' system. We may be using
MapReduce in small components of the application, but it will most likely
be in file access analysis rather than any processing on the files
themselves.

In other words, what I really want is a distributed, resilient, scalable
filesystem.

Is Hadoop suitable if we just use this facility, or would I be misusing it
and inviting grief?

M

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Harsh makes a good point, there is no explicit way to say "these files
should remain in memory". However, I would note that give available
RAM on the datanodes, the operating system will cache recently
accessed blocks.

Brock

On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> Hi,
>>
>> I am a new Hadoop user, and would really appreciate your opinions on whether
>> Hadoop is the right tool for what I'm thinking of using it for.
>>
>> I am investigating options for scaling an archive of around 100Tb of image
>> data. These images are typically TIFF files of around 50-100Mb each and need
>> to be made available online in realtime. Access to the files will be
>> sporadic and occasional, but writing the files will be a daily activity.
>> Speed of write is not particularly important.
>>
>> Our previous solution was a monolithic, expensive - and very full - SAN so I
>> am excited by Hadoop's distributed, extensible, redundant architecture.
>>
>> My concern is that a lot of the discussion on and use cases for Hadoop is
>> regarding data processing with MapReduce and - from what I understand -
>> using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> vague indication that it's not a 'real-time' system. We may be using
>> MapReduce in small components of the application, but it will most likely be
>> in file access analysis rather than any processing on the files themselves.
>>
>> In other words, what I really want is a distributed, resilient, scalable
>> filesystem.
>>
>> Is Hadoop suitable if we just use this facility, or would I be misusing it
>> and inviting grief?
>>
>> M
>
>
>
> --
> Harsh J



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
When you create your jar using netbeans, do you include the Hadoop libraries in the jar you create? 
This would increase the size of the jar and in this case, size does matter. 

On Oct 18, 2012, at 5:06 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> 
> Sir
> 
> We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
When you create your jar using netbeans, do you include the Hadoop libraries in the jar you create? 
This would increase the size of the jar and in this case, size does matter. 

On Oct 18, 2012, at 5:06 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> 
> Sir
> 
> We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
When you create your jar using netbeans, do you include the Hadoop libraries in the jar you create? 
This would increase the size of the jar and in this case, size does matter. 

On Oct 18, 2012, at 5:06 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> 
> Sir
> 
> We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
When you create your jar using netbeans, do you include the Hadoop libraries in the jar you create? 
This would increase the size of the jar and in this case, size does matter. 

On Oct 18, 2012, at 5:06 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> 
> Sir
> 
> We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Sir

We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command 
prompt it runs faster. When we create a jar file from netbeans it runs 
slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA 
(jcuda).We could create a jar file only using netbeans IDE. The 
performance seemed to be very poor. Especially, we feel that it reads 
part by part for every  few KBs

The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism



Kindly let us know about the problem
Thanking you
G Sudha


Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Sir

We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command 
prompt it runs faster. When we create a jar file from netbeans it runs 
slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA 
(jcuda).We could create a jar file only using netbeans IDE. The 
performance seemed to be very poor. Especially, we feel that it reads 
part by part for every  few KBs

The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism



Kindly let us know about the problem
Thanking you
G Sudha


Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Sir

We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command 
prompt it runs faster. When we create a jar file from netbeans it runs 
slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA 
(jcuda).We could create a jar file only using netbeans IDE. The 
performance seemed to be very poor. Especially, we feel that it reads 
part by part for every  few KBs

The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism



Kindly let us know about the problem
Thanking you
G Sudha


Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Sir

We are trying to combine Hadoop and CUDA. When we create a jar file for hadoop programs from command 
prompt it runs faster. When we create a jar file from netbeans it runs 
slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA 
(jcuda).We could create a jar file only using netbeans IDE. The 
performance seemed to be very poor. Especially, we feel that it reads 
part by part for every  few KBs

The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism



Kindly let us know about the problem
Thanking you
G Sudha


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
Please don't hijack a thread. Start your own discussion.

On Oct 16, 2012, at 1:34 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> G Sudha
> 
> --- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:
> 
> From: Manoj Babu <ma...@gmail.com>
> Subject: Re: Hadoop and CUDA
> To: user@hadoop.apache.org
> Date: Tuesday, October 16, 2012, 11:49 AM
> 
> Hi,
> 
> If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.
> 
> 
> Cheers!
> Manoj.
> 
> 
> 
> On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:
> Hello
> 
> When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
Please don't hijack a thread. Start your own discussion.

On Oct 16, 2012, at 1:34 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> G Sudha
> 
> --- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:
> 
> From: Manoj Babu <ma...@gmail.com>
> Subject: Re: Hadoop and CUDA
> To: user@hadoop.apache.org
> Date: Tuesday, October 16, 2012, 11:49 AM
> 
> Hi,
> 
> If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.
> 
> 
> Cheers!
> Manoj.
> 
> 
> 
> On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:
> Hello
> 
> When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
Please don't hijack a thread. Start your own discussion.

On Oct 16, 2012, at 1:34 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> G Sudha
> 
> --- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:
> 
> From: Manoj Babu <ma...@gmail.com>
> Subject: Re: Hadoop and CUDA
> To: user@hadoop.apache.org
> Date: Tuesday, October 16, 2012, 11:49 AM
> 
> Hi,
> 
> If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.
> 
> 
> Cheers!
> Manoj.
> 
> 
> 
> On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:
> Hello
> 
> When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by Michael Segel <mi...@hotmail.com>.
Please don't hijack a thread. Start your own discussion.

On Oct 16, 2012, at 1:34 AM, sudha sadhasivam <su...@yahoo.com> wrote:

> 
> The code executes, but time taken for execution is high
> Does not show any advantages in two levels of parallelism
> G Sudha
> 
> --- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:
> 
> From: Manoj Babu <ma...@gmail.com>
> Subject: Re: Hadoop and CUDA
> To: user@hadoop.apache.org
> Date: Tuesday, October 16, 2012, 11:49 AM
> 
> Hi,
> 
> If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.
> 
> 
> Cheers!
> Manoj.
> 
> 
> 
> On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:
> Hello
> 
> When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.
> 
> This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs
> 
> Kindly let us know about the problem
> Thanking you
> G Sudha
> 


Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism
G Sudha

--- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:

From: Manoj Babu <ma...@gmail.com>
Subject: Re: Hadoop and CUDA
To: user@hadoop.apache.org
Date: Tuesday, October 16, 2012, 11:49 AM

Hi,
If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.

Cheers!Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:


Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs



Kindly let us know about the problem
Thanking you
G Sudha



Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism
G Sudha

--- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:

From: Manoj Babu <ma...@gmail.com>
Subject: Re: Hadoop and CUDA
To: user@hadoop.apache.org
Date: Tuesday, October 16, 2012, 11:49 AM

Hi,
If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.

Cheers!Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:


Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs



Kindly let us know about the problem
Thanking you
G Sudha



Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism
G Sudha

--- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:

From: Manoj Babu <ma...@gmail.com>
Subject: Re: Hadoop and CUDA
To: user@hadoop.apache.org
Date: Tuesday, October 16, 2012, 11:49 AM

Hi,
If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.

Cheers!Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:


Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs



Kindly let us know about the problem
Thanking you
G Sudha



Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
The code executes, but time taken for execution is high
Does not show any advantages in two levels of parallelism
G Sudha

--- On Tue, 10/16/12, Manoj Babu <ma...@gmail.com> wrote:

From: Manoj Babu <ma...@gmail.com>
Subject: Re: Hadoop and CUDA
To: user@hadoop.apache.org
Date: Tuesday, October 16, 2012, 11:49 AM

Hi,
If it is a runnable jar you are creating from netbeans Check only the necessary dependencies are added.

Cheers!Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <su...@yahoo.com> wrote:


Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.



This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs



Kindly let us know about the problem
Thanking you
G Sudha



Re: Hadoop and CUDA

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

If it is a runnable jar you are creating from netbeans Check only the
necessary dependencies are added.


Cheers!
Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <
sudhasadhasivam@yahoo.com> wrote:

> Hello
>
> When we create a jar file for hadoop programs from command prompt it runs
> faster. When we create a jar file from netbeans it runs slower. We could
> not understand the problem.
>
> This is important as we are trying to work with hadoop and CUDA (jcuda).We
> could create a jar file only using netbeans IDE. The performance seemed to
> be very poor. Especially, we feel that it reads part by part for every  few
> KBs
>
> Kindly let us know about the problem
> Thanking you
> G Sudha
>

Re: Hadoop and CUDA

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

If it is a runnable jar you are creating from netbeans Check only the
necessary dependencies are added.


Cheers!
Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <
sudhasadhasivam@yahoo.com> wrote:

> Hello
>
> When we create a jar file for hadoop programs from command prompt it runs
> faster. When we create a jar file from netbeans it runs slower. We could
> not understand the problem.
>
> This is important as we are trying to work with hadoop and CUDA (jcuda).We
> could create a jar file only using netbeans IDE. The performance seemed to
> be very poor. Especially, we feel that it reads part by part for every  few
> KBs
>
> Kindly let us know about the problem
> Thanking you
> G Sudha
>

Re: Hadoop and CUDA

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

If it is a runnable jar you are creating from netbeans Check only the
necessary dependencies are added.


Cheers!
Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <
sudhasadhasivam@yahoo.com> wrote:

> Hello
>
> When we create a jar file for hadoop programs from command prompt it runs
> faster. When we create a jar file from netbeans it runs slower. We could
> not understand the problem.
>
> This is important as we are trying to work with hadoop and CUDA (jcuda).We
> could create a jar file only using netbeans IDE. The performance seemed to
> be very poor. Especially, we feel that it reads part by part for every  few
> KBs
>
> Kindly let us know about the problem
> Thanking you
> G Sudha
>

Re: Hadoop and CUDA

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

If it is a runnable jar you are creating from netbeans Check only the
necessary dependencies are added.


Cheers!
Manoj.



On Tue, Oct 16, 2012 at 11:38 AM, sudha sadhasivam <
sudhasadhasivam@yahoo.com> wrote:

> Hello
>
> When we create a jar file for hadoop programs from command prompt it runs
> faster. When we create a jar file from netbeans it runs slower. We could
> not understand the problem.
>
> This is important as we are trying to work with hadoop and CUDA (jcuda).We
> could create a jar file only using netbeans IDE. The performance seemed to
> be very poor. Especially, we feel that it reads part by part for every  few
> KBs
>
> Kindly let us know about the problem
> Thanking you
> G Sudha
>

Re: Suitability of HDFS for live file store

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
For your original use case, HDFS indeed sounded like an overkill. But once you start thinking of thumbnail generation, PDFs etc, MapReduce obviously fits the bill.

If you wish to do stuff like streaming the stored digital films, clearly, you may want to move your serving somewhere else that works in tandem with Hadoop.

Thanks,
+Vinod

On Oct 15, 2012, at 1:59 PM, Matt Painter wrote:

> Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available.
> 
> There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys.
> 
> M
> 
> 
> On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hi,
> 
> We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.
> 
> I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?
> 
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth something!).
> 
> You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:
> 
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.
> 
> For us, HDFS wins.  I think it has the possibility of being a winner in your case too.
> 
> Brian
> 
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
> 
>> Seems like a heavyweight solution unless you are actually processing the images? 
>> 
>> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
>> 
>> Im surprised there isnt a simpler solution that uses redundancy without all the 
>> daemons and name nodes and task trackers and stuff.
>> 
>> Might make it kind of awkward as a normal file system. 
>> 
>> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hey Matt,
>> 
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>> 
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>> 
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of image
>> > data. These images are typically TIFF files of around 50-100Mb each and need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most likely be
>> > in file access analysis rather than any processing on the files themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing it
>> > and inviting grief?
>> >
>> > M
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> MMSB/UCHC
> 
> 
> 
> 
> -- 
> Matt Painter
> matt@deity.co.nz
> +64 21 115 9378


Re: Suitability of HDFS for live file store

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
For your original use case, HDFS indeed sounded like an overkill. But once you start thinking of thumbnail generation, PDFs etc, MapReduce obviously fits the bill.

If you wish to do stuff like streaming the stored digital films, clearly, you may want to move your serving somewhere else that works in tandem with Hadoop.

Thanks,
+Vinod

On Oct 15, 2012, at 1:59 PM, Matt Painter wrote:

> Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available.
> 
> There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys.
> 
> M
> 
> 
> On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hi,
> 
> We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.
> 
> I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?
> 
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth something!).
> 
> You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:
> 
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.
> 
> For us, HDFS wins.  I think it has the possibility of being a winner in your case too.
> 
> Brian
> 
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
> 
>> Seems like a heavyweight solution unless you are actually processing the images? 
>> 
>> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
>> 
>> Im surprised there isnt a simpler solution that uses redundancy without all the 
>> daemons and name nodes and task trackers and stuff.
>> 
>> Might make it kind of awkward as a normal file system. 
>> 
>> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hey Matt,
>> 
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>> 
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>> 
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of image
>> > data. These images are typically TIFF files of around 50-100Mb each and need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most likely be
>> > in file access analysis rather than any processing on the files themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing it
>> > and inviting grief?
>> >
>> > M
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> MMSB/UCHC
> 
> 
> 
> 
> -- 
> Matt Painter
> matt@deity.co.nz
> +64 21 115 9378


Re: Suitability of HDFS for live file store

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
For your original use case, HDFS indeed sounded like an overkill. But once you start thinking of thumbnail generation, PDFs etc, MapReduce obviously fits the bill.

If you wish to do stuff like streaming the stored digital films, clearly, you may want to move your serving somewhere else that works in tandem with Hadoop.

Thanks,
+Vinod

On Oct 15, 2012, at 1:59 PM, Matt Painter wrote:

> Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available.
> 
> There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys.
> 
> M
> 
> 
> On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hi,
> 
> We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.
> 
> I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?
> 
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth something!).
> 
> You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:
> 
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.
> 
> For us, HDFS wins.  I think it has the possibility of being a winner in your case too.
> 
> Brian
> 
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
> 
>> Seems like a heavyweight solution unless you are actually processing the images? 
>> 
>> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
>> 
>> Im surprised there isnt a simpler solution that uses redundancy without all the 
>> daemons and name nodes and task trackers and stuff.
>> 
>> Might make it kind of awkward as a normal file system. 
>> 
>> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hey Matt,
>> 
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>> 
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>> 
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of image
>> > data. These images are typically TIFF files of around 50-100Mb each and need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most likely be
>> > in file access analysis rather than any processing on the files themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing it
>> > and inviting grief?
>> >
>> > M
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> MMSB/UCHC
> 
> 
> 
> 
> -- 
> Matt Painter
> matt@deity.co.nz
> +64 21 115 9378


Re: Suitability of HDFS for live file store

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
For your original use case, HDFS indeed sounded like an overkill. But once you start thinking of thumbnail generation, PDFs etc, MapReduce obviously fits the bill.

If you wish to do stuff like streaming the stored digital films, clearly, you may want to move your serving somewhere else that works in tandem with Hadoop.

Thanks,
+Vinod

On Oct 15, 2012, at 1:59 PM, Matt Painter wrote:

> Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available.
> 
> There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys.
> 
> M
> 
> 
> On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hi,
> 
> We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.
> 
> I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?
> 
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth something!).
> 
> You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:
> 
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.
> 
> For us, HDFS wins.  I think it has the possibility of being a winner in your case too.
> 
> Brian
> 
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
> 
>> Seems like a heavyweight solution unless you are actually processing the images? 
>> 
>> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
>> 
>> Im surprised there isnt a simpler solution that uses redundancy without all the 
>> daemons and name nodes and task trackers and stuff.
>> 
>> Might make it kind of awkward as a normal file system. 
>> 
>> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>> Hey Matt,
>> 
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>> 
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>> 
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of image
>> > data. These images are typically TIFF files of around 50-100Mb each and need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most likely be
>> > in file access analysis rather than any processing on the files themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing it
>> > and inviting grief?
>> >
>> > M
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> MMSB/UCHC
> 
> 
> 
> 
> -- 
> Matt Painter
> matt@deity.co.nz
> +64 21 115 9378


Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Sorry, I should have provided a bit more detail. Currently our data set
comprises of 50-100Mb TIFF files. In the near future we'd like to store and
process preservation-quality digitised film, which will individually exceed
this size by orders of magnitude (and has currently been in the "too-hard"
basket with our current infrastructure). In general, our thinking thus far
has been very much based on what our current infrastructure can provide -
so I'm excited to have alternatives available.

There will also be thumbnail generation as well as generation of the
screen-resolution JPEGs that I alluded to, and PDF generation. Whether the
JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be
easily regenerated at any stage and their total size will be relatively
small, so it may not be the best fit for storage of these guys.

M


On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>


-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Sorry, I should have provided a bit more detail. Currently our data set
comprises of 50-100Mb TIFF files. In the near future we'd like to store and
process preservation-quality digitised film, which will individually exceed
this size by orders of magnitude (and has currently been in the "too-hard"
basket with our current infrastructure). In general, our thinking thus far
has been very much based on what our current infrastructure can provide -
so I'm excited to have alternatives available.

There will also be thumbnail generation as well as generation of the
screen-resolution JPEGs that I alluded to, and PDF generation. Whether the
JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be
easily regenerated at any stage and their total size will be relatively
small, so it may not be the best fit for storage of these guys.

M


On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>


-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Ted Dunning <td...@maprtech.com>.
If you are going to mention commercial distros, you should include MapR as
well.  Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.

On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>

Re: Suitability of HDFS for live file store

Posted by Ted Dunning <td...@maprtech.com>.
If you are going to mention commercial distros, you should include MapR as
well.  Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.

On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>

Re: Suitability of HDFS for live file store

Posted by Ted Dunning <td...@maprtech.com>.
If you are going to mention commercial distros, you should include MapR as
well.  Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.

On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Sorry, I should have provided a bit more detail. Currently our data set
comprises of 50-100Mb TIFF files. In the near future we'd like to store and
process preservation-quality digitised film, which will individually exceed
this size by orders of magnitude (and has currently been in the "too-hard"
basket with our current infrastructure). In general, our thinking thus far
has been very much based on what our current infrastructure can provide -
so I'm excited to have alternatives available.

There will also be thumbnail generation as well as generation of the
screen-resolution JPEGs that I alluded to, and PDF generation. Whether the
JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be
easily regenerated at any stage and their total size will be relatively
small, so it may not be the best fit for storage of these guys.

M


On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>


-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Ted Dunning <td...@maprtech.com>.
If you are going to mention commercial distros, you should include MapR as
well.  Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.

On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Sorry, I should have provided a bit more detail. Currently our data set
comprises of 50-100Mb TIFF files. In the near future we'd like to store and
process preservation-quality digitised film, which will individually exceed
this size by orders of magnitude (and has currently been in the "too-hard"
basket with our current infrastructure). In general, our thinking thus far
has been very much based on what our current infrastructure can provide -
so I'm excited to have alternatives available.

There will also be thumbnail generation as well as generation of the
screen-resolution JPEGs that I alluded to, and PDF generation. Whether the
JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be
easily regenerated at any stage and their total size will be relatively
small, so it may not be the best fit for storage of these guys.

M


On 16 October 2012 09:35, Brian Bockelman <bb...@cse.unl.edu> wrote:

> Hi,
>
> We use HDFS to process data for the LHC - somewhat similar case here.  Our
> files are a bit larger, our total local data size if ~1PB logical, and we
> "bring our own" batch system, so no Map-Reduce.  We perform many random
> reads, so we are quite sensitive to underlying latency.
>
> I don't see any obvious mismatches between your requirements and HDFS
> capabilities that you can eliminate it as a candidate without an
> evaluation.  Do note that HDFS does not provide complete POSIX semantics -
> but you don't appear to need them?
>
> IMHO, if you are looking for the following requirements:
> 1) Proven petascale data store (never want to be on the bleeding edge of
> your filesystem's scaling!).
> 2) Has self-healing semantics (can recover from the loss of RAIDs or
> entire storage targets).
> 3) Open source (but do consider commercial companies - your time is worth
> something!).
>
> You end up at looking at a very small number of candidates.  Others
> filesystems that should be on your list:
>
> 1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial
> support.  I personally don't know enough to provide a pros/cons list, but
> we keep it on our radar.
> 2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.
>  Requires a quite recent kernel.  Quite good on-paper design.
> 3) Lustre.  I think you'd be disappointed with the self-healing.  A very
> "traditional" HPC/clustered filesystem design.
>
> For us, HDFS wins.  I think it has the possibility of being a winner in
> your case too.
>
> Brian
>
> On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> Seems like a heavyweight solution unless you are actually processing the
> images?
>
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im
> surprised that you are considering hadoop at all ?
>
> Im surprised there isnt a simpler solution that uses redundancy without
> all the
> daemons and name nodes and task trackers and stuff.
>
> Might make it kind of awkward as a normal file system.
>
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hey Matt,
>>
>> What do you mean by 'real-time' though? While HDFS has pretty good
>> contiguous data read speeds (and you get N x replicas to read from),
>> if you're looking to "cache" frequently accessed files into memory
>> then HDFS does not natively have support for that. Otherwise, I agree
>> with Brock, seems like you could make it work with HDFS (sans
>> MapReduce - no need to run it if you don't need it).
>>
>> The presence of NameNode audit logging will help your file access
>> analysis requirement.
>>
>> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> > Hi,
>> >
>> > I am a new Hadoop user, and would really appreciate your opinions on
>> whether
>> > Hadoop is the right tool for what I'm thinking of using it for.
>> >
>> > I am investigating options for scaling an archive of around 100Tb of
>> image
>> > data. These images are typically TIFF files of around 50-100Mb each and
>> need
>> > to be made available online in realtime. Access to the files will be
>> > sporadic and occasional, but writing the files will be a daily activity.
>> > Speed of write is not particularly important.
>> >
>> > Our previous solution was a monolithic, expensive - and very full - SAN
>> so I
>> > am excited by Hadoop's distributed, extensible, redundant architecture.
>> >
>> > My concern is that a lot of the discussion on and use cases for Hadoop
>> is
>> > regarding data processing with MapReduce and - from what I understand -
>> > using HDFS for the purpose of input for MapReduce jobs. My other
>> concern is
>> > vague indication that it's not a 'real-time' system. We may be using
>> > MapReduce in small components of the application, but it will most
>> likely be
>> > in file access analysis rather than any processing on the files
>> themselves.
>> >
>> > In other words, what I really want is a distributed, resilient, scalable
>> > filesystem.
>> >
>> > Is Hadoop suitable if we just use this facility, or would I be misusing
>> it
>> > and inviting grief?
>> >
>> > M
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>
>
>


-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.

This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs

Kindly let us know about the problem
Thanking you
G Sudha

Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.

This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs

Kindly let us know about the problem
Thanking you
G Sudha

Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.

This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs

Kindly let us know about the problem
Thanking you
G Sudha

Re: Hadoop and CUDA

Posted by sudha sadhasivam <su...@yahoo.com>.
Hello

When we create a jar file for hadoop programs from command prompt it runs faster. When we create a jar file from netbeans it runs slower. We could not understand the problem.

This is important as we are trying to work with hadoop and CUDA (jcuda).We could create a jar file only using netbeans IDE. The performance seemed to be very poor. Especially, we feel that it reads part by part for every  few KBs

Kindly let us know about the problem
Thanking you
G Sudha

Re: Suitability of HDFS for live file store

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hi,

We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.

I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?

IMHO, if you are looking for the following requirements:
1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
3) Open source (but do consider commercial companies - your time is worth something!).

You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:

1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.

For us, HDFS wins.  I think it has the possibility of being a winner in your case too.

Brian

On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:

> Seems like a heavyweight solution unless you are actually processing the images? 
> 
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
> 
> Im surprised there isnt a simpler solution that uses redundancy without all the 
> daemons and name nodes and task trackers and stuff.
> 
> Might make it kind of awkward as a normal file system. 
> 
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
> 
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> 
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> 
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of image
> > data. These images are typically TIFF files of around 50-100Mb each and need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most likely be
> > in file access analysis rather than any processing on the files themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing it
> > and inviting grief?
> >
> > M
> 
> 
> 
> --
> Harsh J
> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC


Re: Suitability of HDFS for live file store

Posted by "Goldstone, Robin J." <go...@llnl.gov>.
If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster.  It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS.  Supposedly you can even run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list.  Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself.

From: Jay Vyas <ja...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com>> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz>> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hi,

We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.

I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?

IMHO, if you are looking for the following requirements:
1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
3) Open source (but do consider commercial companies - your time is worth something!).

You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:

1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.

For us, HDFS wins.  I think it has the possibility of being a winner in your case too.

Brian

On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:

> Seems like a heavyweight solution unless you are actually processing the images? 
> 
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
> 
> Im surprised there isnt a simpler solution that uses redundancy without all the 
> daemons and name nodes and task trackers and stuff.
> 
> Might make it kind of awkward as a normal file system. 
> 
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
> 
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> 
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> 
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of image
> > data. These images are typically TIFF files of around 50-100Mb each and need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most likely be
> > in file access analysis rather than any processing on the files themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing it
> > and inviting grief?
> >
> > M
> 
> 
> 
> --
> Harsh J
> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC


Re: Suitability of HDFS for live file store

Posted by "Goldstone, Robin J." <go...@llnl.gov>.
If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster.  It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS.  Supposedly you can even run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list.  Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself.

From: Jay Vyas <ja...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com>> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz>> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hi,

We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.

I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?

IMHO, if you are looking for the following requirements:
1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
3) Open source (but do consider commercial companies - your time is worth something!).

You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:

1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.

For us, HDFS wins.  I think it has the possibility of being a winner in your case too.

Brian

On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:

> Seems like a heavyweight solution unless you are actually processing the images? 
> 
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
> 
> Im surprised there isnt a simpler solution that uses redundancy without all the 
> daemons and name nodes and task trackers and stuff.
> 
> Might make it kind of awkward as a normal file system. 
> 
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
> 
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> 
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> 
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of image
> > data. These images are typically TIFF files of around 50-100Mb each and need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most likely be
> > in file access analysis rather than any processing on the files themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing it
> > and inviting grief?
> >
> > M
> 
> 
> 
> --
> Harsh J
> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC


Re: Suitability of HDFS for live file store

Posted by "Goldstone, Robin J." <go...@llnl.gov>.
If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster.  It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS.  Supposedly you can even run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list.  Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself.

From: Jay Vyas <ja...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com>> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz>> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hi,

We use HDFS to process data for the LHC - somewhat similar case here.  Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce.  We perform many random reads, so we are quite sensitive to underlying latency.

I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation.  Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?

IMHO, if you are looking for the following requirements:
1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!).
2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets).
3) Open source (but do consider commercial companies - your time is worth something!).

You end up at looking at a very small number of candidates.  Others filesystems that should be on your list:

1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial support.  I personally don't know enough to provide a pros/cons list, but we keep it on our radar.
2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  Requires a quite recent kernel.  Quite good on-paper design.
3) Lustre.  I think you'd be disappointed with the self-healing.  A very "traditional" HPC/clustered filesystem design.

For us, HDFS wins.  I think it has the possibility of being a winner in your case too.

Brian

On Oct 15, 2012, at 3:21 PM, Jay Vyas <ja...@gmail.com> wrote:

> Seems like a heavyweight solution unless you are actually processing the images? 
> 
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?
> 
> Im surprised there isnt a simpler solution that uses redundancy without all the 
> daemons and name nodes and task trackers and stuff.
> 
> Might make it kind of awkward as a normal file system. 
> 
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
> 
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> 
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> 
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of image
> > data. These images are typically TIFF files of around 50-100Mb each and need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most likely be
> > in file access analysis rather than any processing on the files themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing it
> > and inviting grief?
> >
> > M
> 
> 
> 
> --
> Harsh J
> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC


Re: Suitability of HDFS for live file store

Posted by "Goldstone, Robin J." <go...@llnl.gov>.
If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster.  It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS.  Supposedly you can even run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list.  Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself.

From: Jay Vyas <ja...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com>> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz>> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Jay Vyas <ja...@gmail.com>.
Seems like a heavyweight solution unless you are actually processing the
images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im
surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all
the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Jay Vyas <ja...@gmail.com>.
Seems like a heavyweight solution unless you are actually processing the
images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im
surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all
the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Harsh makes a good point, there is no explicit way to say "these files
should remain in memory". However, I would note that give available
RAM on the datanodes, the operating system will cache recently
accessed blocks.

Brock

On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> Hi,
>>
>> I am a new Hadoop user, and would really appreciate your opinions on whether
>> Hadoop is the right tool for what I'm thinking of using it for.
>>
>> I am investigating options for scaling an archive of around 100Tb of image
>> data. These images are typically TIFF files of around 50-100Mb each and need
>> to be made available online in realtime. Access to the files will be
>> sporadic and occasional, but writing the files will be a daily activity.
>> Speed of write is not particularly important.
>>
>> Our previous solution was a monolithic, expensive - and very full - SAN so I
>> am excited by Hadoop's distributed, extensible, redundant architecture.
>>
>> My concern is that a lot of the discussion on and use cases for Hadoop is
>> regarding data processing with MapReduce and - from what I understand -
>> using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> vague indication that it's not a 'real-time' system. We may be using
>> MapReduce in small components of the application, but it will most likely be
>> in file access analysis rather than any processing on the files themselves.
>>
>> In other words, what I really want is a distributed, resilient, scalable
>> filesystem.
>>
>> Is Hadoop suitable if we just use this facility, or would I be misusing it
>> and inviting grief?
>>
>> M
>
>
>
> --
> Harsh J



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Jay Vyas <ja...@gmail.com>.
Seems like a heavyweight solution unless you are actually processing the
images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im
surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all
the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Harsh makes a good point, there is no explicit way to say "these files
should remain in memory". However, I would note that give available
RAM on the datanodes, the operating system will cache recently
accessed blocks.

Brock

On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> Hi,
>>
>> I am a new Hadoop user, and would really appreciate your opinions on whether
>> Hadoop is the right tool for what I'm thinking of using it for.
>>
>> I am investigating options for scaling an archive of around 100Tb of image
>> data. These images are typically TIFF files of around 50-100Mb each and need
>> to be made available online in realtime. Access to the files will be
>> sporadic and occasional, but writing the files will be a daily activity.
>> Speed of write is not particularly important.
>>
>> Our previous solution was a monolithic, expensive - and very full - SAN so I
>> am excited by Hadoop's distributed, extensible, redundant architecture.
>>
>> My concern is that a lot of the discussion on and use cases for Hadoop is
>> regarding data processing with MapReduce and - from what I understand -
>> using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> vague indication that it's not a 'real-time' system. We may be using
>> MapReduce in small components of the application, but it will most likely be
>> in file access analysis rather than any processing on the files themselves.
>>
>> In other words, what I really want is a distributed, resilient, scalable
>> filesystem.
>>
>> Is Hadoop suitable if we just use this facility, or would I be misusing it
>> and inviting grief?
>>
>> M
>
>
>
> --
> Harsh J



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Harsh makes a good point, there is no explicit way to say "these files
should remain in memory". However, I would note that give available
RAM on the datanodes, the operating system will cache recently
accessed blocks.

Brock

On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
>> Hi,
>>
>> I am a new Hadoop user, and would really appreciate your opinions on whether
>> Hadoop is the right tool for what I'm thinking of using it for.
>>
>> I am investigating options for scaling an archive of around 100Tb of image
>> data. These images are typically TIFF files of around 50-100Mb each and need
>> to be made available online in realtime. Access to the files will be
>> sporadic and occasional, but writing the files will be a daily activity.
>> Speed of write is not particularly important.
>>
>> Our previous solution was a monolithic, expensive - and very full - SAN so I
>> am excited by Hadoop's distributed, extensible, redundant architecture.
>>
>> My concern is that a lot of the discussion on and use cases for Hadoop is
>> regarding data processing with MapReduce and - from what I understand -
>> using HDFS for the purpose of input for MapReduce jobs. My other concern is
>> vague indication that it's not a 'real-time' system. We may be using
>> MapReduce in small components of the application, but it will most likely be
>> in file access analysis rather than any processing on the files themselves.
>>
>> In other words, what I really want is a distributed, resilient, scalable
>> filesystem.
>>
>> Is Hadoop suitable if we just use this facility, or would I be misusing it
>> and inviting grief?
>>
>> M
>
>
>
> --
> Harsh J



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Jay Vyas <ja...@gmail.com>.
Seems like a heavyweight solution unless you are actually processing the
images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im
surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all
the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Matt Painter <ma...@deity.co.nz>.
Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <ha...@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>



-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

Re: Suitability of HDFS for live file store

Posted by Harsh J <ha...@cloudera.com>.
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Harsh J

Re: Suitability of HDFS for live file store

Posted by Harsh J <ha...@cloudera.com>.
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Harsh J

Re: Suitability of HDFS for live file store

Posted by Harsh J <ha...@cloudera.com>.
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Harsh J

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Generally I do not see a problem with your plan of using HDFS to store
these files, assuming they are updated rarely if ever. Hadoop is
traditionally a batch system and MapReduce largely remains a batch
system. I'd argue this because minimum job latencies are in the
"seconds" range. HDFS, however, has real time systems built on top of
it, like HBase. The main issue to be concerned with when using HDFS as
simply storage is file size. As HDFS stores it's metadata in RAM, you
don't want to create tremendous numbers of "small" files. With
50-100MB files you should fine.

Cheers,
Brock

On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Harsh J <ha...@cloudera.com>.
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Harsh J

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Generally I do not see a problem with your plan of using HDFS to store
these files, assuming they are updated rarely if ever. Hadoop is
traditionally a batch system and MapReduce largely remains a batch
system. I'd argue this because minimum job latencies are in the
"seconds" range. HDFS, however, has real time systems built on top of
it, like HBase. The main issue to be concerned with when using HDFS as
simply storage is file size. As HDFS stores it's metadata in RAM, you
don't want to create tremendous numbers of "small" files. With
50-100MB files you should fine.

Cheers,
Brock

On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Generally I do not see a problem with your plan of using HDFS to store
these files, assuming they are updated rarely if ever. Hadoop is
traditionally a batch system and MapReduce largely remains a batch
system. I'd argue this because minimum job latencies are in the
"seconds" range. HDFS, however, has real time systems built on top of
it, like HBase. The main issue to be concerned with when using HDFS as
simply storage is file size. As HDFS stores it's metadata in RAM, you
don't want to create tremendous numbers of "small" files. With
50-100MB files you should fine.

Cheers,
Brock

On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Suitability of HDFS for live file store

Posted by Brock Noland <br...@cloudera.com>.
Hi,

Generally I do not see a problem with your plan of using HDFS to store
these files, assuming they are updated rarely if ever. Hadoop is
traditionally a batch system and MapReduce largely remains a batch
system. I'd argue this because minimum job latencies are in the
"seconds" range. HDFS, however, has real time systems built on top of
it, like HBase. The main issue to be concerned with when using HDFS as
simply storage is file size. As HDFS stores it's metadata in RAM, you
don't want to create tremendous numbers of "small" files. With
50-100MB files you should fine.

Cheers,
Brock

On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <ma...@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/