You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Trevor Harmon <tr...@vocaro.com> on 2014/11/09 18:16:20 UTC

Is it wrong to bypass HDFS?

Hi,

I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.

I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.

However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.

Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,

Trevor

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

Ah, yes, I remember reading about custom InputFormats but did not realize they could bypass HDFS entirely. Sounds like a good solution, I will look into it. Thanks,

Trevor

> On Nov 9, 2014, at 12:48 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
> You should consider writing a custom InputFormat which reads directly from the database - while FileInputformat is the most common class for InputFormat, the specification  for InputFormat or what the critical method getSplits does not require HDFS - 
> A custom version can return database entries as splits and write a custom RecordReader to return the values.
> For your problem a RecordReader reading only a single record from the database might be very reasonable
> 
> On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <drdwitte@gmail.com <ma...@gmail.com>> wrote:
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

Ah, yes, I remember reading about custom InputFormats but did not realize they could bypass HDFS entirely. Sounds like a good solution, I will look into it. Thanks,

Trevor

> On Nov 9, 2014, at 12:48 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
> You should consider writing a custom InputFormat which reads directly from the database - while FileInputformat is the most common class for InputFormat, the specification  for InputFormat or what the critical method getSplits does not require HDFS - 
> A custom version can return database entries as splits and write a custom RecordReader to return the values.
> For your problem a RecordReader reading only a single record from the database might be very reasonable
> 
> On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <drdwitte@gmail.com <ma...@gmail.com>> wrote:
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

Ah, yes, I remember reading about custom InputFormats but did not realize they could bypass HDFS entirely. Sounds like a good solution, I will look into it. Thanks,

Trevor

> On Nov 9, 2014, at 12:48 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
> You should consider writing a custom InputFormat which reads directly from the database - while FileInputformat is the most common class for InputFormat, the specification  for InputFormat or what the critical method getSplits does not require HDFS - 
> A custom version can return database entries as splits and write a custom RecordReader to return the values.
> For your problem a RecordReader reading only a single record from the database might be very reasonable
> 
> On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <drdwitte@gmail.com <ma...@gmail.com>> wrote:
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

Ah, yes, I remember reading about custom InputFormats but did not realize they could bypass HDFS entirely. Sounds like a good solution, I will look into it. Thanks,

Trevor

> On Nov 9, 2014, at 12:48 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
> You should consider writing a custom InputFormat which reads directly from the database - while FileInputformat is the most common class for InputFormat, the specification  for InputFormat or what the critical method getSplits does not require HDFS - 
> A custom version can return database entries as splits and write a custom RecordReader to return the values.
> For your problem a RecordReader reading only a single record from the database might be very reasonable
> 
> On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <drdwitte@gmail.com <ma...@gmail.com>> wrote:
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>

Re: Is it wrong to bypass HDFS?

Posted by Steve Lewis <lo...@gmail.com>.

You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification  for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as splits and write a custom
RecordReader to return the values.
For your problem a RecordReader reading only a single record from the
database might be very reasonable

On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:

> 100MB is very small, so the overhead of putting the data in hdfs is also
> very small. Does it even make sense to optimize this? (reading/writing will
> only take a second or so) If you don't want to stream data to hdfs and you
> have very little data then you should look in to alternative high
> performance paradigms such as OpenMP or MPI I think..
>
> Regards, D
>
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:
>
>> Hi,
>>
>> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
>> job. The amount of data is small -- about 100MB per job, and about 0.25MB
>> per work item -- but the reduce phase is very CPU-intensive, requiring
>> about 30 seconds to reduce each mapper's output to a single value. The goal
>> is to speed up the computation by distributing the tasks across many
>> machines.
>>
>> I am not sure how the mappers would work in this scenario. My initial
>> thought was that there would be one mapper per reducer, and each mapper
>> would fetch its input directly from the source database, using an input key
>> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
>> would then do some necessary fix-up and massaging of the data to prepare it
>> for the reduction phase.
>>
>> However, none of the tutorials and example code I’ve seen do it this way.
>> They always copy the data from the source database to HDFS first. For my
>> use case, this seems wasteful. The data per task is very small and can fit
>> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
>> data” redundant storage. Also, the data is read only once per task, so
>> there’s nothing to be gained by the data locality optimizations of HDFS.
>> Having to copy the data to an intermediate data store seems unnecessary and
>> just adds overhead in this case.
>>
>> Is it okay to bypass HDFS for certain types of problems, such as this
>> one? Or is there some reason mappers should never perform external I/O? I
>> am very new to Hadoop so I don’t have much experience to go on here. Thank
>> you,
>>
>> Trevor
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of copying data to HDFS adds up. I guess my main concern was whether allowing mappers to fetch the input data would violate some technical rule or map-reduce principle.

I have considered alternative solutions like OpenMP, but the Hadoop ecosystem seems richer and better supported among cloud providers such as Heroku and AWS.

Trevor

> On Nov 9, 2014, at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:
> 
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
>

Re: Is it wrong to bypass HDFS?

Posted by Steve Lewis <lo...@gmail.com>.

You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification  for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as splits and write a custom
RecordReader to return the values.
For your problem a RecordReader reading only a single record from the
database might be very reasonable

On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:

> 100MB is very small, so the overhead of putting the data in hdfs is also
> very small. Does it even make sense to optimize this? (reading/writing will
> only take a second or so) If you don't want to stream data to hdfs and you
> have very little data then you should look in to alternative high
> performance paradigms such as OpenMP or MPI I think..
>
> Regards, D
>
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:
>
>> Hi,
>>
>> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
>> job. The amount of data is small -- about 100MB per job, and about 0.25MB
>> per work item -- but the reduce phase is very CPU-intensive, requiring
>> about 30 seconds to reduce each mapper's output to a single value. The goal
>> is to speed up the computation by distributing the tasks across many
>> machines.
>>
>> I am not sure how the mappers would work in this scenario. My initial
>> thought was that there would be one mapper per reducer, and each mapper
>> would fetch its input directly from the source database, using an input key
>> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
>> would then do some necessary fix-up and massaging of the data to prepare it
>> for the reduction phase.
>>
>> However, none of the tutorials and example code I’ve seen do it this way.
>> They always copy the data from the source database to HDFS first. For my
>> use case, this seems wasteful. The data per task is very small and can fit
>> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
>> data” redundant storage. Also, the data is read only once per task, so
>> there’s nothing to be gained by the data locality optimizations of HDFS.
>> Having to copy the data to an intermediate data store seems unnecessary and
>> just adds overhead in this case.
>>
>> Is it okay to bypass HDFS for certain types of problems, such as this
>> one? Or is there some reason mappers should never perform external I/O? I
>> am very new to Hadoop so I don’t have much experience to go on here. Thank
>> you,
>>
>> Trevor
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of copying data to HDFS adds up. I guess my main concern was whether allowing mappers to fetch the input data would violate some technical rule or map-reduce principle.

I have considered alternative solutions like OpenMP, but the Hadoop ecosystem seems richer and better supported among cloud providers such as Heroku and AWS.

Trevor

> On Nov 9, 2014, at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:
> 
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
>

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of copying data to HDFS adds up. I guess my main concern was whether allowing mappers to fetch the input data would violate some technical rule or map-reduce principle.

I have considered alternative solutions like OpenMP, but the Hadoop ecosystem seems richer and better supported among cloud providers such as Heroku and AWS.

Trevor

> On Nov 9, 2014, at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:
> 
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
>

Re: Is it wrong to bypass HDFS?

Posted by Trevor Harmon <tr...@vocaro.com>.

You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of copying data to HDFS adds up. I guess my main concern was whether allowing mappers to fetch the input data would violate some technical rule or map-reduce principle.

I have considered alternative solutions like OpenMP, but the Hadoop ecosystem seems richer and better supported among cloud providers such as Heroku and AWS.

Trevor

> On Nov 9, 2014, at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:
> 
> 100MB is very small, so the overhead of putting the data in hdfs is also very small. Does it even make sense to optimize this? (reading/writing will only take a second or so) If you don't want to stream data to hdfs and you have very little data then you should look in to alternative high performance paradigms such as OpenMP or MPI I think..
> 
> Regards, D
> 
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <trevor@vocaro.com <ma...@vocaro.com>>:
> Hi,
> 
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to a single value. The goal is to speed up the computation by distributing the tasks across many machines.
> 
> I am not sure how the mappers would work in this scenario. My initial thought was that there would be one mapper per reducer, and each mapper would fetch its input directly from the source database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It would then do some necessary fix-up and massaging of the data to prepare it for the reduction phase.
> 
> However, none of the tutorials and example code I’ve seen do it this way. They always copy the data from the source database to HDFS first. For my use case, this seems wasteful. The data per task is very small and can fit entirely in the mapper’s and reducer’s main memory, so I don’t need “big data” redundant storage. Also, the data is read only once per task, so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy the data to an intermediate data store seems unnecessary and just adds overhead in this case.
> 
> Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have much experience to go on here. Thank you,
> 
> Trevor
> 
>

Re: Is it wrong to bypass HDFS?

Posted by Steve Lewis <lo...@gmail.com>.

You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification  for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as splits and write a custom
RecordReader to return the values.
For your problem a RecordReader reading only a single record from the
database might be very reasonable

On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:

> 100MB is very small, so the overhead of putting the data in hdfs is also
> very small. Does it even make sense to optimize this? (reading/writing will
> only take a second or so) If you don't want to stream data to hdfs and you
> have very little data then you should look in to alternative high
> performance paradigms such as OpenMP or MPI I think..
>
> Regards, D
>
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:
>
>> Hi,
>>
>> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
>> job. The amount of data is small -- about 100MB per job, and about 0.25MB
>> per work item -- but the reduce phase is very CPU-intensive, requiring
>> about 30 seconds to reduce each mapper's output to a single value. The goal
>> is to speed up the computation by distributing the tasks across many
>> machines.
>>
>> I am not sure how the mappers would work in this scenario. My initial
>> thought was that there would be one mapper per reducer, and each mapper
>> would fetch its input directly from the source database, using an input key
>> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
>> would then do some necessary fix-up and massaging of the data to prepare it
>> for the reduction phase.
>>
>> However, none of the tutorials and example code I’ve seen do it this way.
>> They always copy the data from the source database to HDFS first. For my
>> use case, this seems wasteful. The data per task is very small and can fit
>> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
>> data” redundant storage. Also, the data is read only once per task, so
>> there’s nothing to be gained by the data locality optimizations of HDFS.
>> Having to copy the data to an intermediate data store seems unnecessary and
>> just adds overhead in this case.
>>
>> Is it okay to bypass HDFS for certain types of problems, such as this
>> one? Or is there some reason mappers should never perform external I/O? I
>> am very new to Hadoop so I don’t have much experience to go on here. Thank
>> you,
>>
>> Trevor
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is it wrong to bypass HDFS?

Posted by Steve Lewis <lo...@gmail.com>.

You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification  for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as splits and write a custom
RecordReader to return the values.
For your problem a RecordReader reading only a single record from the
database might be very reasonable

On Sun, Nov 9, 2014 at 11:52 AM, Dieter De Witte <dr...@gmail.com> wrote:

> 100MB is very small, so the overhead of putting the data in hdfs is also
> very small. Does it even make sense to optimize this? (reading/writing will
> only take a second or so) If you don't want to stream data to hdfs and you
> have very little data then you should look in to alternative high
> performance paradigms such as OpenMP or MPI I think..
>
> Regards, D
>
> 2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:
>
>> Hi,
>>
>> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
>> job. The amount of data is small -- about 100MB per job, and about 0.25MB
>> per work item -- but the reduce phase is very CPU-intensive, requiring
>> about 30 seconds to reduce each mapper's output to a single value. The goal
>> is to speed up the computation by distributing the tasks across many
>> machines.
>>
>> I am not sure how the mappers would work in this scenario. My initial
>> thought was that there would be one mapper per reducer, and each mapper
>> would fetch its input directly from the source database, using an input key
>> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
>> would then do some necessary fix-up and massaging of the data to prepare it
>> for the reduction phase.
>>
>> However, none of the tutorials and example code I’ve seen do it this way.
>> They always copy the data from the source database to HDFS first. For my
>> use case, this seems wasteful. The data per task is very small and can fit
>> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
>> data” redundant storage. Also, the data is read only once per task, so
>> there’s nothing to be gained by the data locality optimizations of HDFS.
>> Having to copy the data to an intermediate data store seems unnecessary and
>> just adds overhead in this case.
>>
>> Is it okay to bypass HDFS for certain types of problems, such as this
>> one? Or is there some reason mappers should never perform external I/O? I
>> am very new to Hadoop so I don’t have much experience to go on here. Thank
>> you,
>>
>> Trevor
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is it wrong to bypass HDFS?

Posted by Dieter De Witte <dr...@gmail.com>.

100MB is very small, so the overhead of putting the data in hdfs is also
very small. Does it even make sense to optimize this? (reading/writing will
only take a second or so) If you don't want to stream data to hdfs and you
have very little data then you should look in to alternative high
performance paradigms such as OpenMP or MPI I think..

Regards, D

2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:

> Hi,
>
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
> job. The amount of data is small -- about 100MB per job, and about 0.25MB
> per work item -- but the reduce phase is very CPU-intensive, requiring
> about 30 seconds to reduce each mapper's output to a single value. The goal
> is to speed up the computation by distributing the tasks across many
> machines.
>
> I am not sure how the mappers would work in this scenario. My initial
> thought was that there would be one mapper per reducer, and each mapper
> would fetch its input directly from the source database, using an input key
> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
> would then do some necessary fix-up and massaging of the data to prepare it
> for the reduction phase.
>
> However, none of the tutorials and example code I’ve seen do it this way.
> They always copy the data from the source database to HDFS first. For my
> use case, this seems wasteful. The data per task is very small and can fit
> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
> data” redundant storage. Also, the data is read only once per task, so
> there’s nothing to be gained by the data locality optimizations of HDFS.
> Having to copy the data to an intermediate data store seems unnecessary and
> just adds overhead in this case.
>
> Is it okay to bypass HDFS for certain types of problems, such as this one?
> Or is there some reason mappers should never perform external I/O? I am
> very new to Hadoop so I don’t have much experience to go on here. Thank you,
>
> Trevor
>
>

Re: Is it wrong to bypass HDFS?

Posted by Dieter De Witte <dr...@gmail.com>.

100MB is very small, so the overhead of putting the data in hdfs is also
very small. Does it even make sense to optimize this? (reading/writing will
only take a second or so) If you don't want to stream data to hdfs and you
have very little data then you should look in to alternative high
performance paradigms such as OpenMP or MPI I think..

Regards, D

2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:

> Hi,
>
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
> job. The amount of data is small -- about 100MB per job, and about 0.25MB
> per work item -- but the reduce phase is very CPU-intensive, requiring
> about 30 seconds to reduce each mapper's output to a single value. The goal
> is to speed up the computation by distributing the tasks across many
> machines.
>
> I am not sure how the mappers would work in this scenario. My initial
> thought was that there would be one mapper per reducer, and each mapper
> would fetch its input directly from the source database, using an input key
> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
> would then do some necessary fix-up and massaging of the data to prepare it
> for the reduction phase.
>
> However, none of the tutorials and example code I’ve seen do it this way.
> They always copy the data from the source database to HDFS first. For my
> use case, this seems wasteful. The data per task is very small and can fit
> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
> data” redundant storage. Also, the data is read only once per task, so
> there’s nothing to be gained by the data locality optimizations of HDFS.
> Having to copy the data to an intermediate data store seems unnecessary and
> just adds overhead in this case.
>
> Is it okay to bypass HDFS for certain types of problems, such as this one?
> Or is there some reason mappers should never perform external I/O? I am
> very new to Hadoop so I don’t have much experience to go on here. Thank you,
>
> Trevor
>
>

Re: Is it wrong to bypass HDFS?

Posted by Dieter De Witte <dr...@gmail.com>.

100MB is very small, so the overhead of putting the data in hdfs is also
very small. Does it even make sense to optimize this? (reading/writing will
only take a second or so) If you don't want to stream data to hdfs and you
have very little data then you should look in to alternative high
performance paradigms such as OpenMP or MPI I think..

Regards, D

2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:

> Hi,
>
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
> job. The amount of data is small -- about 100MB per job, and about 0.25MB
> per work item -- but the reduce phase is very CPU-intensive, requiring
> about 30 seconds to reduce each mapper's output to a single value. The goal
> is to speed up the computation by distributing the tasks across many
> machines.
>
> I am not sure how the mappers would work in this scenario. My initial
> thought was that there would be one mapper per reducer, and each mapper
> would fetch its input directly from the source database, using an input key
> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
> would then do some necessary fix-up and massaging of the data to prepare it
> for the reduction phase.
>
> However, none of the tutorials and example code I’ve seen do it this way.
> They always copy the data from the source database to HDFS first. For my
> use case, this seems wasteful. The data per task is very small and can fit
> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
> data” redundant storage. Also, the data is read only once per task, so
> there’s nothing to be gained by the data locality optimizations of HDFS.
> Having to copy the data to an intermediate data store seems unnecessary and
> just adds overhead in this case.
>
> Is it okay to bypass HDFS for certain types of problems, such as this one?
> Or is there some reason mappers should never perform external I/O? I am
> very new to Hadoop so I don’t have much experience to go on here. Thank you,
>
> Trevor
>
>

Re: Is it wrong to bypass HDFS?

Posted by Dieter De Witte <dr...@gmail.com>.

100MB is very small, so the overhead of putting the data in hdfs is also
very small. Does it even make sense to optimize this? (reading/writing will
only take a second or so) If you don't want to stream data to hdfs and you
have very little data then you should look in to alternative high
performance paradigms such as OpenMP or MPI I think..

Regards, D

2014-11-09 18:16 GMT+01:00 Trevor Harmon <tr...@vocaro.com>:

> Hi,
>
> I’m trying to model an "embarrassingly parallel" problem as a map-reduce
> job. The amount of data is small -- about 100MB per job, and about 0.25MB
> per work item -- but the reduce phase is very CPU-intensive, requiring
> about 30 seconds to reduce each mapper's output to a single value. The goal
> is to speed up the computation by distributing the tasks across many
> machines.
>
> I am not sure how the mappers would work in this scenario. My initial
> thought was that there would be one mapper per reducer, and each mapper
> would fetch its input directly from the source database, using an input key
> provided by Hadoop. (Remember it’s only about 0.25MB per work item.) It
> would then do some necessary fix-up and massaging of the data to prepare it
> for the reduction phase.
>
> However, none of the tutorials and example code I’ve seen do it this way.
> They always copy the data from the source database to HDFS first. For my
> use case, this seems wasteful. The data per task is very small and can fit
> entirely in the mapper’s and reducer’s main memory, so I don’t need “big
> data” redundant storage. Also, the data is read only once per task, so
> there’s nothing to be gained by the data locality optimizations of HDFS.
> Having to copy the data to an intermediate data store seems unnecessary and
> just adds overhead in this case.
>
> Is it okay to bypass HDFS for certain types of problems, such as this one?
> Or is there some reason mappers should never perform external I/O? I am
> very new to Hadoop so I don’t have much experience to go on here. Thank you,
>
> Trevor
>
>