You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Fengyun RAO <ra...@gmail.com> on 2014/03/01 08:44:51 UTC

Map-Reduce: How to make MR output one file an hour?

It's a common web log analysis situation. The original weblog is saved
every hour on multiple servers.
Now we would like the parsed log results to be saved one file an hour. How
to make it?

In our MR job, the input is a directory with many files in many hours,
let's say 4X files in X hours.
if there are e.g. 10 Reducers, then all of the results would be partitioned
into 10 files, each of which contains results in every hour.
We would like the results to be save in X files, each of which contains
only one-hour result.
Since the input files could change, I can't even set the reducer number to
be exactly X in the program.

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Devin Suiter RDX <ds...@rdx.com>.

If you only want one file, then you need to set the number of reducers to 1.

If the size of the data makes the original MR job impractical to use a
single reducer, you run a second job on the output of the first, with the
default mapper and reducer, which are the Identity- ones, and set that
numReducers = 1.

Or use hdfs getmerge function to collate the results to one file.
On Mar 1, 2014 4:59 AM, "Fengyun RAO" <ra...@gmail.com> wrote:

> Thanks, but how to set reducer number to X? X is dependent on input
> (run-time), which is unknown on job configuration (compile time).
>
>
> 2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:
>
>> Hi,
>>
>> Write the custom partitioner on <timestamp> and as you mentioned set
>> #reducers to X.
>>
>>
>>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Devin Suiter RDX <ds...@rdx.com>.

If you only want one file, then you need to set the number of reducers to 1.

If the size of the data makes the original MR job impractical to use a
single reducer, you run a second job on the output of the first, with the
default mapper and reducer, which are the Identity- ones, and set that
numReducers = 1.

Or use hdfs getmerge function to collate the results to one file.
On Mar 1, 2014 4:59 AM, "Fengyun RAO" <ra...@gmail.com> wrote:

> Thanks, but how to set reducer number to X? X is dependent on input
> (run-time), which is unknown on job configuration (compile time).
>
>
> 2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:
>
>> Hi,
>>
>> Write the custom partitioner on <timestamp> and as you mentioned set
>> #reducers to X.
>>
>>
>>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Devin Suiter RDX <ds...@rdx.com>.

If you only want one file, then you need to set the number of reducers to 1.

If the size of the data makes the original MR job impractical to use a
single reducer, you run a second job on the output of the first, with the
default mapper and reducer, which are the Identity- ones, and set that
numReducers = 1.

Or use hdfs getmerge function to collate the results to one file.
On Mar 1, 2014 4:59 AM, "Fengyun RAO" <ra...@gmail.com> wrote:

> Thanks, but how to set reducer number to X? X is dependent on input
> (run-time), which is unknown on job configuration (compile time).
>
>
> 2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:
>
>> Hi,
>>
>> Write the custom partitioner on <timestamp> and as you mentioned set
>> #reducers to X.
>>
>>
>>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Devin Suiter RDX <ds...@rdx.com>.

If you only want one file, then you need to set the number of reducers to 1.

If the size of the data makes the original MR job impractical to use a
single reducer, you run a second job on the output of the first, with the
default mapper and reducer, which are the Identity- ones, and set that
numReducers = 1.

Or use hdfs getmerge function to collate the results to one file.
On Mar 1, 2014 4:59 AM, "Fengyun RAO" <ra...@gmail.com> wrote:

> Thanks, but how to set reducer number to X? X is dependent on input
> (run-time), which is unknown on job configuration (compile time).
>
>
> 2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:
>
>> Hi,
>>
>> Write the custom partitioner on <timestamp> and as you mentioned set
>> #reducers to X.
>>
>>
>>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).


2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:

> Hi,
>
> Write the custom partitioner on <timestamp> and as you mentioned set
> #reducers to X.
>
>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).


2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:

> Hi,
>
> Write the custom partitioner on <timestamp> and as you mentioned set
> #reducers to X.
>
>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).


2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:

> Hi,
>
> Write the custom partitioner on <timestamp> and as you mentioned set
> #reducers to X.
>
>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Fengyun RAO <ra...@gmail.com>.

Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).


2014-03-01 17:44 GMT+08:00 AnilKumar B <ak...@gmail.com>:

> Hi,
>
> Write the custom partitioner on <timestamp> and as you mentioned set
> #reducers to X.
>
>
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by AnilKumar B <ak...@gmail.com>.

Hi,

Write the custom partitioner on <timestamp> and as you mentioned set
#reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Simon Dong <si...@gmail.com>.

You can use MultipleOutputs and construct the custom file name based on
timestamp.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html


On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO <ra...@gmail.com> wrote:

> It's a common web log analysis situation. The original weblog is saved
> every hour on multiple servers.
> Now we would like the parsed log results to be saved one file an hour. How
> to make it?
>
> In our MR job, the input is a directory with many files in many hours,
> let's say 4X files in X hours.
> if there are e.g. 10 Reducers, then all of the results would be
> partitioned into 10 files, each of which contains results in every hour.
> We would like the results to be save in X files, each of which contains
> only one-hour result.
> Since the input files could change, I can't even set the reducer number to
> be exactly X in the program.
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Simon Dong <si...@gmail.com>.

You can use MultipleOutputs and construct the custom file name based on
timestamp.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html


On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO <ra...@gmail.com> wrote:

> It's a common web log analysis situation. The original weblog is saved
> every hour on multiple servers.
> Now we would like the parsed log results to be saved one file an hour. How
> to make it?
>
> In our MR job, the input is a directory with many files in many hours,
> let's say 4X files in X hours.
> if there are e.g. 10 Reducers, then all of the results would be
> partitioned into 10 files, each of which contains results in every hour.
> We would like the results to be save in X files, each of which contains
> only one-hour result.
> Since the input files could change, I can't even set the reducer number to
> be exactly X in the program.
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Simon Dong <si...@gmail.com>.

You can use MultipleOutputs and construct the custom file name based on
timestamp.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html


On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO <ra...@gmail.com> wrote:

> It's a common web log analysis situation. The original weblog is saved
> every hour on multiple servers.
> Now we would like the parsed log results to be saved one file an hour. How
> to make it?
>
> In our MR job, the input is a directory with many files in many hours,
> let's say 4X files in X hours.
> if there are e.g. 10 Reducers, then all of the results would be
> partitioned into 10 files, each of which contains results in every hour.
> We would like the results to be save in X files, each of which contains
> only one-hour result.
> Since the input files could change, I can't even set the reducer number to
> be exactly X in the program.
>

Re: Map-Reduce: How to make MR output one file an hour?

Posted by AnilKumar B <ak...@gmail.com>.

Hi,

Write the custom partitioner on <timestamp> and as you mentioned set
#reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

Posted by AnilKumar B <ak...@gmail.com>.

Hi,

Write the custom partitioner on <timestamp> and as you mentioned set
#reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

Posted by AnilKumar B <ak...@gmail.com>.

Hi,

Write the custom partitioner on <timestamp> and as you mentioned set
#reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

Posted by Simon Dong <si...@gmail.com>.

You can use MultipleOutputs and construct the custom file name based on
timestamp.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html


On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO <ra...@gmail.com> wrote:

> It's a common web log analysis situation. The original weblog is saved
> every hour on multiple servers.
> Now we would like the parsed log results to be saved one file an hour. How
> to make it?
>
> In our MR job, the input is a directory with many files in many hours,
> let's say 4X files in X hours.
> if there are e.g. 10 Reducers, then all of the results would be
> partitioned into 10 files, each of which contains results in every hour.
> We would like the results to be save in X files, each of which contains
> only one-hour result.
> Since the input files could change, I can't even set the reducer number to
> be exactly X in the program.
>