You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Public Network Services <pu...@gmail.com> on 2013/03/06 00:35:01 UTC

Execution handover in map/reduce pipeline

Hi...

I have an application that processes large amounts of proprietary
binary-encoded text data in the following sequence

   1. Gets a URL to a file or a directory as input
   2. Reads the list of the binary files found under the input URL
   3. Extracts the text data from each of those files
   4. Saves the text data into new files
   5. Informs the application about newly extracted files
   6. Processes each of the extracted text files
   7. Submits the processing results to a proprietary data repository

This whole processing is highly CPU-intensive and can be partially
parallelized, so I am thinking of trying Hadoop for achieving higher
performance.

So, assuming that all the above take place in HDFS (including the input URL
being an HDFS one), a MapReduce implementation could use

   - A lightweight non-Hadoop thread to kick-start the execution flow, i.e.
   implement step 1
   - A Mapper that would implement steps 2-4
   - A Reducer that would implement step 5 (receive the notifications)
   - A Mapper that would implement step 6
   - A Reducer that would implement step 7

The first mapper (for steps 2-4) will probably need to do its processing in
a single, non-parallelized step.

My question is, how is the first reducer going to hand over execution to
the second mapper, once done?

Or, is there a better way of implementing the above scenario?

Thanks!

Re: Execution handover in map/reduce pipeline

Posted by Shumin Guo <gs...@gmail.com>.

Oozie for mapreduce job flow management can be a good choice. It can be too
heavy weight for your problem.

Based on your description. I am simply assuming that you are processing
some static data files, for example, the files will not change on the way
of processing, and there are no interdependence among the files.

The first job transforms from binary files to text files. You can use the
FileInputFormat and FileOutputFormat.

The second job should be started after the previous job is done. And you
can still use the FileInputFormat and choose the proper output format, or
write your own.

HTH.

On Wed, Mar 6, 2013 at 5:04 AM, Michel Segel <mi...@hotmail.com>wrote:

> RTFM?
>
> Yes you can do this.  See Oozie.
>
> When you have a cryptic name, you get a cryptic answer.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Mar 5, 2013, at 5:35 PM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
> Hi...
>
> I have an application that processes large amounts of proprietary
> binary-encoded text data in the following sequence
>
>    1. Gets a URL to a file or a directory as input
>    2. Reads the list of the binary files found under the input URL
>    3. Extracts the text data from each of those files
>    4. Saves the text data into new files
>    5. Informs the application about newly extracted files
>    6. Processes each of the extracted text files
>    7. Submits the processing results to a proprietary data repository
>
> This whole processing is highly CPU-intensive and can be partially
> parallelized, so I am thinking of trying Hadoop for achieving higher
> performance.
>
> So, assuming that all the above take place in HDFS (including the input
> URL being an HDFS one), a MapReduce implementation could use
>
>    - A lightweight non-Hadoop thread to kick-start the execution flow,
>    i.e. implement step 1
>    - A Mapper that would implement steps 2-4
>    - A Reducer that would implement step 5 (receive the notifications)
>    - A Mapper that would implement step 6
>    - A Reducer that would implement step 7
>
> The first mapper (for steps 2-4) will probably need to do its processing
> in a single, non-parallelized step.
>
> My question is, how is the first reducer going to hand over execution to
> the second mapper, once done?
>
> Or, is there a better way of implementing the above scenario?
>
> Thanks!
>
>

Re: Execution handover in map/reduce pipeline

Posted by Shumin Guo <gs...@gmail.com>.

Oozie for mapreduce job flow management can be a good choice. It can be too
heavy weight for your problem.

Based on your description. I am simply assuming that you are processing
some static data files, for example, the files will not change on the way
of processing, and there are no interdependence among the files.

The first job transforms from binary files to text files. You can use the
FileInputFormat and FileOutputFormat.

The second job should be started after the previous job is done. And you
can still use the FileInputFormat and choose the proper output format, or
write your own.

HTH.

On Wed, Mar 6, 2013 at 5:04 AM, Michel Segel <mi...@hotmail.com>wrote:

> RTFM?
>
> Yes you can do this.  See Oozie.
>
> When you have a cryptic name, you get a cryptic answer.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Mar 5, 2013, at 5:35 PM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
> Hi...
>
> I have an application that processes large amounts of proprietary
> binary-encoded text data in the following sequence
>
>    1. Gets a URL to a file or a directory as input
>    2. Reads the list of the binary files found under the input URL
>    3. Extracts the text data from each of those files
>    4. Saves the text data into new files
>    5. Informs the application about newly extracted files
>    6. Processes each of the extracted text files
>    7. Submits the processing results to a proprietary data repository
>
> This whole processing is highly CPU-intensive and can be partially
> parallelized, so I am thinking of trying Hadoop for achieving higher
> performance.
>
> So, assuming that all the above take place in HDFS (including the input
> URL being an HDFS one), a MapReduce implementation could use
>
>    - A lightweight non-Hadoop thread to kick-start the execution flow,
>    i.e. implement step 1
>    - A Mapper that would implement steps 2-4
>    - A Reducer that would implement step 5 (receive the notifications)
>    - A Mapper that would implement step 6
>    - A Reducer that would implement step 7
>
> The first mapper (for steps 2-4) will probably need to do its processing
> in a single, non-parallelized step.
>
> My question is, how is the first reducer going to hand over execution to
> the second mapper, once done?
>
> Or, is there a better way of implementing the above scenario?
>
> Thanks!
>
>

Re: Execution handover in map/reduce pipeline

Posted by Shumin Guo <gs...@gmail.com>.

Oozie for mapreduce job flow management can be a good choice. It can be too
heavy weight for your problem.

Based on your description. I am simply assuming that you are processing
some static data files, for example, the files will not change on the way
of processing, and there are no interdependence among the files.

The first job transforms from binary files to text files. You can use the
FileInputFormat and FileOutputFormat.

The second job should be started after the previous job is done. And you
can still use the FileInputFormat and choose the proper output format, or
write your own.

HTH.

On Wed, Mar 6, 2013 at 5:04 AM, Michel Segel <mi...@hotmail.com>wrote:

> RTFM?
>
> Yes you can do this.  See Oozie.
>
> When you have a cryptic name, you get a cryptic answer.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Mar 5, 2013, at 5:35 PM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
> Hi...
>
> I have an application that processes large amounts of proprietary
> binary-encoded text data in the following sequence
>
>    1. Gets a URL to a file or a directory as input
>    2. Reads the list of the binary files found under the input URL
>    3. Extracts the text data from each of those files
>    4. Saves the text data into new files
>    5. Informs the application about newly extracted files
>    6. Processes each of the extracted text files
>    7. Submits the processing results to a proprietary data repository
>
> This whole processing is highly CPU-intensive and can be partially
> parallelized, so I am thinking of trying Hadoop for achieving higher
> performance.
>
> So, assuming that all the above take place in HDFS (including the input
> URL being an HDFS one), a MapReduce implementation could use
>
>    - A lightweight non-Hadoop thread to kick-start the execution flow,
>    i.e. implement step 1
>    - A Mapper that would implement steps 2-4
>    - A Reducer that would implement step 5 (receive the notifications)
>    - A Mapper that would implement step 6
>    - A Reducer that would implement step 7
>
> The first mapper (for steps 2-4) will probably need to do its processing
> in a single, non-parallelized step.
>
> My question is, how is the first reducer going to hand over execution to
> the second mapper, once done?
>
> Or, is there a better way of implementing the above scenario?
>
> Thanks!
>
>

Re: Execution handover in map/reduce pipeline

Posted by Shumin Guo <gs...@gmail.com>.

Oozie for mapreduce job flow management can be a good choice. It can be too
heavy weight for your problem.

Based on your description. I am simply assuming that you are processing
some static data files, for example, the files will not change on the way
of processing, and there are no interdependence among the files.

The first job transforms from binary files to text files. You can use the
FileInputFormat and FileOutputFormat.

The second job should be started after the previous job is done. And you
can still use the FileInputFormat and choose the proper output format, or
write your own.

HTH.

On Wed, Mar 6, 2013 at 5:04 AM, Michel Segel <mi...@hotmail.com>wrote:

> RTFM?
>
> Yes you can do this.  See Oozie.
>
> When you have a cryptic name, you get a cryptic answer.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Mar 5, 2013, at 5:35 PM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
> Hi...
>
> I have an application that processes large amounts of proprietary
> binary-encoded text data in the following sequence
>
>    1. Gets a URL to a file or a directory as input
>    2. Reads the list of the binary files found under the input URL
>    3. Extracts the text data from each of those files
>    4. Saves the text data into new files
>    5. Informs the application about newly extracted files
>    6. Processes each of the extracted text files
>    7. Submits the processing results to a proprietary data repository
>
> This whole processing is highly CPU-intensive and can be partially
> parallelized, so I am thinking of trying Hadoop for achieving higher
> performance.
>
> So, assuming that all the above take place in HDFS (including the input
> URL being an HDFS one), a MapReduce implementation could use
>
>    - A lightweight non-Hadoop thread to kick-start the execution flow,
>    i.e. implement step 1
>    - A Mapper that would implement steps 2-4
>    - A Reducer that would implement step 5 (receive the notifications)
>    - A Mapper that would implement step 6
>    - A Reducer that would implement step 7
>
> The first mapper (for steps 2-4) will probably need to do its processing
> in a single, non-parallelized step.
>
> My question is, how is the first reducer going to hand over execution to
> the second mapper, once done?
>
> Or, is there a better way of implementing the above scenario?
>
> Thanks!
>
>

Re: Execution handover in map/reduce pipeline

Posted by Michel Segel <mi...@hotmail.com>.

RTFM?

Yes you can do this.  See Oozie.

When you have a cryptic name, you get a cryptic answer.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 5:35 PM, Public Network Services <pu...@gmail.com> wrote:

> Hi...
> 
> I have an application that processes large amounts of proprietary binary-encoded text data in the following sequence
> Gets a URL to a file or a directory as input
> Reads the list of the binary files found under the input URL
> Extracts the text data from each of those files
> Saves the text data into new files
> Informs the application about newly extracted files
> Processes each of the extracted text files
> Submits the processing results to a proprietary data repository
> This whole processing is highly CPU-intensive and can be partially parallelized, so I am thinking of trying Hadoop for achieving higher performance.
> 
> So, assuming that all the above take place in HDFS (including the input URL being an HDFS one), a MapReduce implementation could use
> A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step 1
> A Mapper that would implement steps 2-4
> A Reducer that would implement step 5 (receive the notifications)
> A Mapper that would implement step 6
> A Reducer that would implement step 7
> The first mapper (for steps 2-4) will probably need to do its processing in a single, non-parallelized step.
> 
> My question is, how is the first reducer going to hand over execution to the second mapper, once done?
> 
> Or, is there a better way of implementing the above scenario?
> 
> Thanks!
>

Re: Execution handover in map/reduce pipeline

Posted by Michel Segel <mi...@hotmail.com>.

RTFM?

Yes you can do this.  See Oozie.

When you have a cryptic name, you get a cryptic answer.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 5:35 PM, Public Network Services <pu...@gmail.com> wrote:

> Hi...
> 
> I have an application that processes large amounts of proprietary binary-encoded text data in the following sequence
> Gets a URL to a file or a directory as input
> Reads the list of the binary files found under the input URL
> Extracts the text data from each of those files
> Saves the text data into new files
> Informs the application about newly extracted files
> Processes each of the extracted text files
> Submits the processing results to a proprietary data repository
> This whole processing is highly CPU-intensive and can be partially parallelized, so I am thinking of trying Hadoop for achieving higher performance.
> 
> So, assuming that all the above take place in HDFS (including the input URL being an HDFS one), a MapReduce implementation could use
> A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step 1
> A Mapper that would implement steps 2-4
> A Reducer that would implement step 5 (receive the notifications)
> A Mapper that would implement step 6
> A Reducer that would implement step 7
> The first mapper (for steps 2-4) will probably need to do its processing in a single, non-parallelized step.
> 
> My question is, how is the first reducer going to hand over execution to the second mapper, once done?
> 
> Or, is there a better way of implementing the above scenario?
> 
> Thanks!
>

Re: Execution handover in map/reduce pipeline

Posted by Michel Segel <mi...@hotmail.com>.

RTFM?

Yes you can do this.  See Oozie.

When you have a cryptic name, you get a cryptic answer.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 5:35 PM, Public Network Services <pu...@gmail.com> wrote:

> Hi...
> 
> I have an application that processes large amounts of proprietary binary-encoded text data in the following sequence
> Gets a URL to a file or a directory as input
> Reads the list of the binary files found under the input URL
> Extracts the text data from each of those files
> Saves the text data into new files
> Informs the application about newly extracted files
> Processes each of the extracted text files
> Submits the processing results to a proprietary data repository
> This whole processing is highly CPU-intensive and can be partially parallelized, so I am thinking of trying Hadoop for achieving higher performance.
> 
> So, assuming that all the above take place in HDFS (including the input URL being an HDFS one), a MapReduce implementation could use
> A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step 1
> A Mapper that would implement steps 2-4
> A Reducer that would implement step 5 (receive the notifications)
> A Mapper that would implement step 6
> A Reducer that would implement step 7
> The first mapper (for steps 2-4) will probably need to do its processing in a single, non-parallelized step.
> 
> My question is, how is the first reducer going to hand over execution to the second mapper, once done?
> 
> Or, is there a better way of implementing the above scenario?
> 
> Thanks!
>

Re: Execution handover in map/reduce pipeline

Posted by Michel Segel <mi...@hotmail.com>.

RTFM?

Yes you can do this.  See Oozie.

When you have a cryptic name, you get a cryptic answer.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 5:35 PM, Public Network Services <pu...@gmail.com> wrote:

> Hi...
> 
> I have an application that processes large amounts of proprietary binary-encoded text data in the following sequence
> Gets a URL to a file or a directory as input
> Reads the list of the binary files found under the input URL
> Extracts the text data from each of those files
> Saves the text data into new files
> Informs the application about newly extracted files
> Processes each of the extracted text files
> Submits the processing results to a proprietary data repository
> This whole processing is highly CPU-intensive and can be partially parallelized, so I am thinking of trying Hadoop for achieving higher performance.
> 
> So, assuming that all the above take place in HDFS (including the input URL being an HDFS one), a MapReduce implementation could use
> A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step 1
> A Mapper that would implement steps 2-4
> A Reducer that would implement step 5 (receive the notifications)
> A Mapper that would implement step 6
> A Reducer that would implement step 7
> The first mapper (for steps 2-4) will probably need to do its processing in a single, non-parallelized step.
> 
> My question is, how is the first reducer going to hand over execution to the second mapper, once done?
> 
> Or, is there a better way of implementing the above scenario?
> 
> Thanks!
>