You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Matthieu Labour <ma...@actionx.com> on 2012/09/25 20:08:06 UTC

Help on a Simple program

Hi

I am completely new to Hadoop and I am trying to address the following
simple application. I apologize if this sounds trivial.

I have multiple log files I need to read the log files and collect the
entries that meet some conditions and write them back to files for further
processing. ( On other words, I need to filter out some events)

I am using the WordCount example to get going.

public static class Map extends
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if(-1 != meetConditions(value)) {
                context.write(value, one);
            }
        }
    }

public static class Reduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            context.write(key, new IntWritable(1));
        }
    }

The problem is that it prints the value 1 after each entry.

Hence my question. What is the best trivial implementation of the map and
reduce function to address the use case above ?

Thank you greatly for your help

Re: Help on a Simple program

Posted by Bertrand Dechoux <de...@gmail.com>.

First, the default reducer implementation is the identity so you could
reuse it directly.
Second, to make thing clearer, you could use NullWritable instead of
IntWritable.
Third, with regards to the output, you may need to write a custom output
format (and I don't see another way except using pig, cascading...
http://www.cascading.org/2012/07/02/cascading-for-the-impatient-part-1/).
Fourth, in java you have a boolean type, so you might want
your meetConditions function to return one instead of an integer.

Regards

Bertrand

On Tue, Sep 25, 2012 at 8:08 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>



-- 
Bertrand Dechoux

Re: Help on a Simple program

Posted by Bejoy Ks <be...@gmail.com>.

Hi

If you don't want either key or value in the output, just make the
corresponding data types as NullWritable.

Since you just need to filter out a few records/itemd from your logs,
reduce phase is not mandatory just a mappper would suffice your needs. From
your mapper just output the records that match your criteria. Also set
number of reduce tasks to zero in your driver class to completely avoid the
reduce phase.

A sample code would look like

public static class Map extends
            Mapper<LongWritable, Text, Text, NullWritable> {
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if(-1 != meetConditions(value)) {
                context.write(value, NullWritable.*get*());
            }
        }
    }


Om your driver class
*job.setNumReduceTasks(0);*
*
*
*Alternatively you can specify this st runtime as*
hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/

On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>

Re: Help on a Simple program

Posted by Bejoy Ks <be...@gmail.com>.

Hi

If you don't want either key or value in the output, just make the
corresponding data types as NullWritable.

Since you just need to filter out a few records/itemd from your logs,
reduce phase is not mandatory just a mappper would suffice your needs. From
your mapper just output the records that match your criteria. Also set
number of reduce tasks to zero in your driver class to completely avoid the
reduce phase.

A sample code would look like

public static class Map extends
            Mapper<LongWritable, Text, Text, NullWritable> {
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if(-1 != meetConditions(value)) {
                context.write(value, NullWritable.*get*());
            }
        }
    }


Om your driver class
*job.setNumReduceTasks(0);*
*
*
*Alternatively you can specify this st runtime as*
hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/

On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>

Re: Help on a Simple program

Posted by Bertrand Dechoux <de...@gmail.com>.

First, the default reducer implementation is the identity so you could
reuse it directly.
Second, to make thing clearer, you could use NullWritable instead of
IntWritable.
Third, with regards to the output, you may need to write a custom output
format (and I don't see another way except using pig, cascading...
http://www.cascading.org/2012/07/02/cascading-for-the-impatient-part-1/).
Fourth, in java you have a boolean type, so you might want
your meetConditions function to return one instead of an integer.

Regards

Bertrand

On Tue, Sep 25, 2012 at 8:08 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>



-- 
Bertrand Dechoux

Re: Help on a Simple program

Posted by Bertrand Dechoux <de...@gmail.com>.

First, the default reducer implementation is the identity so you could
reuse it directly.
Second, to make thing clearer, you could use NullWritable instead of
IntWritable.
Third, with regards to the output, you may need to write a custom output
format (and I don't see another way except using pig, cascading...
http://www.cascading.org/2012/07/02/cascading-for-the-impatient-part-1/).
Fourth, in java you have a boolean type, so you might want
your meetConditions function to return one instead of an integer.

Regards

Bertrand

On Tue, Sep 25, 2012 at 8:08 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>



-- 
Bertrand Dechoux

Re: Help on a Simple program

Posted by Bertrand Dechoux <de...@gmail.com>.

First, the default reducer implementation is the identity so you could
reuse it directly.
Second, to make thing clearer, you could use NullWritable instead of
IntWritable.
Third, with regards to the output, you may need to write a custom output
format (and I don't see another way except using pig, cascading...
http://www.cascading.org/2012/07/02/cascading-for-the-impatient-part-1/).
Fourth, in java you have a boolean type, so you might want
your meetConditions function to return one instead of an integer.

Regards

Bertrand

On Tue, Sep 25, 2012 at 8:08 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>



-- 
Bertrand Dechoux

Re: Help on a Simple program

Posted by Bejoy Ks <be...@gmail.com>.

Hi

If you don't want either key or value in the output, just make the
corresponding data types as NullWritable.

Since you just need to filter out a few records/itemd from your logs,
reduce phase is not mandatory just a mappper would suffice your needs. From
your mapper just output the records that match your criteria. Also set
number of reduce tasks to zero in your driver class to completely avoid the
reduce phase.

A sample code would look like

public static class Map extends
            Mapper<LongWritable, Text, Text, NullWritable> {
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if(-1 != meetConditions(value)) {
                context.write(value, NullWritable.*get*());
            }
        }
    }


Om your driver class
*job.setNumReduceTasks(0);*
*
*
*Alternatively you can specify this st runtime as*
hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/

On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>

Re: Help on a Simple program

Posted by Bejoy Ks <be...@gmail.com>.

Hi

If you don't want either key or value in the output, just make the
corresponding data types as NullWritable.

Since you just need to filter out a few records/itemd from your logs,
reduce phase is not mandatory just a mappper would suffice your needs. From
your mapper just output the records that match your criteria. Also set
number of reduce tasks to zero in your driver class to completely avoid the
reduce phase.

A sample code would look like

public static class Map extends
            Mapper<LongWritable, Text, Text, NullWritable> {
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            if(-1 != meetConditions(value)) {
                context.write(value, NullWritable.*get*());
            }
        }
    }


Om your driver class
*job.setNumReduceTasks(0);*
*
*
*Alternatively you can specify this st runtime as*
hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/

On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour <ma...@actionx.com>wrote:

> Hi
>
> I am completely new to Hadoop and I am trying to address the following
> simple application. I apologize if this sounds trivial.
>
> I have multiple log files I need to read the log files and collect the
> entries that meet some conditions and write them back to files for further
> processing. ( On other words, I need to filter out some events)
>
> I am using the WordCount example to get going.
>
> public static class Map extends
>             Mapper<LongWritable, Text, Text, IntWritable> {
>         private final static IntWritable one = new IntWritable(1);
>
>         public void map(LongWritable key, Text value, Context context)
>                 throws IOException, InterruptedException {
>             if(-1 != meetConditions(value)) {
>                 context.write(value, one);
>             }
>         }
>     }
>
> public static class Reduce extends
>             Reducer<Text, IntWritable, Text, IntWritable> {
>
>         public void reduce(Text key, Iterable<IntWritable> values,
>                 Context context) throws IOException, InterruptedException {
>             context.write(key, new IntWritable(1));
>         }
>     }
>
> The problem is that it prints the value 1 after each entry.
>
> Hence my question. What is the best trivial implementation of the map and
> reduce function to address the use case above ?
>
> Thank you greatly for your help
>