You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2014/05/15 22:45:31 UTC

I need advice on whether my starting data needs to be in HDFS

I have a medium size data set in the terrabytes range that currently lives
in the nfs file server of a medium institution. Every few months we want to
run a chain of five Hadoop jobs on this data.
   The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The
book says copy the data to HDFS and run the job. If I consider copy to hdfs
and the first mapper as a single task I wonder if it is not as easy to have
a custom reader reading from the NFS file system as a local file and skip
the step of copying to hadoop.
   While the read to the mapper may be slower, dropping the copy to hdfs
could well make up the difference. Assume that after the job runs the data
will be deleted from hdfs - the nfs system is the primary source and that
cannot change. Also the job is not I/O limited - there is significant
computation at each step

    My questions are
  1) are my assumptions correct and not copying the data may save time?
  2) would 200 Hadoop jobs overwhelm a medium sized nfs system?

RE: I need advice on whether my starting data needs to be in HDFS

Posted by Christoph Schmitz <ch...@1und1.de>.

Hi Steve,

I'll second David's opinion (if I get it correctly ;-) that importing your data first and doing interesting processing later would be a good idea.

Of course you can read your data from NFS during your actual processing, as long as all of your TaskTrackers have access to the NFS. Keep in mind, though, that importing your data in a separate step gives you a lot more possibilities, e.g. to control the load on your NFS, to do some simple preprocessing such as splitting the data by date, making sure your data is ready for further processing (e.g. transformed into a splittable format such as SequenceFiles) etc. In my team, we usually end up writing custom import code (rather that using, say, distcp), but that pays off in the actual processing code.

IMHO, with 200 tasks (I assume that's what you meant) running in parallel you'll probably saturate a "normal" NFS server, unless your computation is very expensive so that I/O is only happening very sparsely in your CPU-bound task.

Regards,
Christoph

-----Original Message-----
From: David Rosenstrauch [mailto:darose@darose.net] 
Sent: Montag, 19. Mai 2014 16:33
To: mapreduce-user@hadoop.apache.org
Subject: Re: I need advice on whether my starting data needs to be in HDFS

That's pre-processing.  I.e., yes, before you run your job, when you 
push the file from local file system into HDFS then it is reading from a 
single machine.

However, presumably the processing your job does is the (far) more 
lengthy activity, and so running the job can take much less time if the 
data is in hdfs.

DR

On 05/19/2014 10:27 AM, Steve Lewis wrote:
> I understand that although I can write an InputFormat / splitter which
> starts with a file in NFS not HDFS.
> Also when I count import to hdfs as a part of processing haven't I gone to
> a single machine to read all data?
>
>
> On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch <da...@darose.net>wrote:
>
>> The reason why you want to copy to hdfs first is that hdfs splits the data
>> and distributes it across the nodes in the cluster.  So if your input data
>> is large, you'll get much better efficiency/speed in processing it if
>> you're processing it in a distributed manner.  (I.e., multiple machines
>> each processing a piece of it - multiple mappers.) I'd think that keeping
>> the in NFS would be quite slow.
>>
>> HTH,
>>
>> DR
>>
>>
>> On 05/15/2014 04:45 PM, Steve Lewis wrote:
>>
>>> I have a medium size data set in the terrabytes range that currently lives
>>> in the nfs file server of a medium institution. Every few months we want
>>> to
>>> run a chain of five Hadoop jobs on this data.
>>>      The cluster is medium sized - 40 nodes about 200 simultaneous jobs.
>>> The
>>> book says copy the data to HDFS and run the job. If I consider copy to
>>> hdfs
>>> and the first mapper as a single task I wonder if it is not as easy to
>>> have
>>> a custom reader reading from the NFS file system as a local file and skip
>>> the step of copying to hadoop.
>>>      While the read to the mapper may be slower, dropping the copy to hdfs
>>> could well make up the difference. Assume that after the job runs the data
>>> will be deleted from hdfs - the nfs system is the primary source and that
>>> cannot change. Also the job is not I/O limited - there is significant
>>> computation at each step
>>>
>>>       My questions are
>>>     1) are my assumptions correct and not copying the data may save time?
>>>     2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>>>
>>>
>>
>
>

Re: I need advice on whether my starting data needs to be in HDFS

Posted by David Rosenstrauch <da...@darose.net>.

That's pre-processing.  I.e., yes, before you run your job, when you 
push the file from local file system into HDFS then it is reading from a 
single machine.

However, presumably the processing your job does is the (far) more 
lengthy activity, and so running the job can take much less time if the 
data is in hdfs.

DR

On 05/19/2014 10:27 AM, Steve Lewis wrote:
> I understand that although I can write an InputFormat / splitter which
> starts with a file in NFS not HDFS.
> Also when I count import to hdfs as a part of processing haven't I gone to
> a single machine to read all data?
>
>
> On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch <da...@darose.net>wrote:
>
>> The reason why you want to copy to hdfs first is that hdfs splits the data
>> and distributes it across the nodes in the cluster.  So if your input data
>> is large, you'll get much better efficiency/speed in processing it if
>> you're processing it in a distributed manner.  (I.e., multiple machines
>> each processing a piece of it - multiple mappers.) I'd think that keeping
>> the in NFS would be quite slow.
>>
>> HTH,
>>
>> DR
>>
>>
>> On 05/15/2014 04:45 PM, Steve Lewis wrote:
>>
>>> I have a medium size data set in the terrabytes range that currently lives
>>> in the nfs file server of a medium institution. Every few months we want
>>> to
>>> run a chain of five Hadoop jobs on this data.
>>>      The cluster is medium sized - 40 nodes about 200 simultaneous jobs.
>>> The
>>> book says copy the data to HDFS and run the job. If I consider copy to
>>> hdfs
>>> and the first mapper as a single task I wonder if it is not as easy to
>>> have
>>> a custom reader reading from the NFS file system as a local file and skip
>>> the step of copying to hadoop.
>>>      While the read to the mapper may be slower, dropping the copy to hdfs
>>> could well make up the difference. Assume that after the job runs the data
>>> will be deleted from hdfs - the nfs system is the primary source and that
>>> cannot change. Also the job is not I/O limited - there is significant
>>> computation at each step
>>>
>>>       My questions are
>>>     1) are my assumptions correct and not copying the data may save time?
>>>     2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>>>
>>>
>>
>
>

Re: I need advice on whether my starting data needs to be in HDFS

Posted by Steve Lewis <lo...@gmail.com>.

I understand that although I can write an InputFormat / splitter which
starts with a file in NFS not HDFS.
Also when I count import to hdfs as a part of processing haven't I gone to
a single machine to read all data?


On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch <da...@darose.net>wrote:

> The reason why you want to copy to hdfs first is that hdfs splits the data
> and distributes it across the nodes in the cluster.  So if your input data
> is large, you'll get much better efficiency/speed in processing it if
> you're processing it in a distributed manner.  (I.e., multiple machines
> each processing a piece of it - multiple mappers.) I'd think that keeping
> the in NFS would be quite slow.
>
> HTH,
>
> DR
>
>
> On 05/15/2014 04:45 PM, Steve Lewis wrote:
>
>> I have a medium size data set in the terrabytes range that currently lives
>> in the nfs file server of a medium institution. Every few months we want
>> to
>> run a chain of five Hadoop jobs on this data.
>>     The cluster is medium sized - 40 nodes about 200 simultaneous jobs.
>> The
>> book says copy the data to HDFS and run the job. If I consider copy to
>> hdfs
>> and the first mapper as a single task I wonder if it is not as easy to
>> have
>> a custom reader reading from the NFS file system as a local file and skip
>> the step of copying to hadoop.
>>     While the read to the mapper may be slower, dropping the copy to hdfs
>> could well make up the difference. Assume that after the job runs the data
>> will be deleted from hdfs - the nfs system is the primary source and that
>> cannot change. Also the job is not I/O limited - there is significant
>> computation at each step
>>
>>      My questions are
>>    1) are my assumptions correct and not copying the data may save time?
>>    2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I need advice on whether my starting data needs to be in HDFS

Posted by David Rosenstrauch <da...@darose.net>.

The reason why you want to copy to hdfs first is that hdfs splits the 
data and distributes it across the nodes in the cluster.  So if your 
input data is large, you'll get much better efficiency/speed in 
processing it if you're processing it in a distributed manner.  (I.e., 
multiple machines each processing a piece of it - multiple mappers.) 
I'd think that keeping the in NFS would be quite slow.

HTH,

DR

On 05/15/2014 04:45 PM, Steve Lewis wrote:
> I have a medium size data set in the terrabytes range that currently lives
> in the nfs file server of a medium institution. Every few months we want to
> run a chain of five Hadoop jobs on this data.
>     The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The
> book says copy the data to HDFS and run the job. If I consider copy to hdfs
> and the first mapper as a single task I wonder if it is not as easy to have
> a custom reader reading from the NFS file system as a local file and skip
> the step of copying to hadoop.
>     While the read to the mapper may be slower, dropping the copy to hdfs
> could well make up the difference. Assume that after the job runs the data
> will be deleted from hdfs - the nfs system is the primary source and that
> cannot change. Also the job is not I/O limited - there is significant
> computation at each step
>
>      My questions are
>    1) are my assumptions correct and not copying the data may save time?
>    2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>

Re: I need advice on whether my starting data needs to be in HDFS

Posted by David Rosenstrauch <da...@darose.net>.

On 05/15/2014 04:45 PM, Steve Lewis wrote:
> I have a medium size data set in the terrabytes range that currently lives
> in the nfs file server of a medium institution. Every few months we want to
> run a chain of five Hadoop jobs on this data.
>     The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The
> book says copy the data to HDFS and run the job. If I consider copy to hdfs
> and the first mapper as a single task I wonder if it is not as easy to have
> a custom reader reading from the NFS file system as a local file and skip
> the step of copying to hadoop.
>     While the read to the mapper may be slower, dropping the copy to hdfs
> could well make up the difference. Assume that after the job runs the data
> will be deleted from hdfs - the nfs system is the primary source and that
> cannot change. Also the job is not I/O limited - there is significant
> computation at each step
>
>      My questions are
>    1) are my assumptions correct and not copying the data may save time?
>    2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>