You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/02/04 23:04:08 UTC

hadoop: how to find top N frequently occurring words

Hi,

Can someone guide me on how to write program using hadoop framework
that analyze the log files and find out the top most frequently
occurring keywords. The log file has the format -

keyword source dateId

Thanks,
Tarandeep

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.
Oops.  Glad I said approximately.

If you want this to be exactly correct, you need to set it to have a single
reducer and also to use an adder as a combiner.

If the compression due to counting is insufficient to make the single
reducer much faster than the map steps, then putting a top-N filter in the
combiner will help.  Make sure that you have a much deeper limit in the
combiner than in the reducer to make sure that you don't accidentally lose
any counts.  If you are looking at millions of input records, then having a
thousand records come out of each mapper/combiner pair is no big deal.

On 2/5/08 11:30 AM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> On Feb 4, 2008 3:30 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> Approximately this:
>> 
>>    TopNReducer extends MapReduceBase
>>           implements Reducer<Text, IntWritable, Text, IntWritable> {
>>       OrderedSet<KeyWordIntegerPair> top =
>>             new TreeSet<KeyWordIntegerPair>();
>>       FileSystem fs;
>> 
>>       void configure(JobConf conf) {
>>          fs = FileSystem.get(conf);
>>       }
>> 
>>       void reduce(Text keyword, IntWritable counts,
>>             OutputCollector<Text, IntWritable> out, Reporter reporter) {
>>          int sum = 0;
>>          while (counts.hasNext) {
>>             sum += counts.next();
>>          }
>> 
>>          if (top.size() < 10 || sum > top.first().getCount()) {
>>               top.add(new KeyWordIntegerPair(keyword, sum);
>>          }
>> 
>>          while (top.size() > 10) {
>>              top.remove(0);
>>          }
>>      }
>> 
>>      void close() {
>>          PrintWriter out = new PW(fs.create(new Path("top-counts")));
>>          for (v : top) {
>>              out.printf("%s\t%d\n", v.keyword(), v.count());
>>          }
>>      }
>>    }
>> 
> 
> Correct me if I am wrong ...
> Here Reducer class is using OrderedSet data structure. The JobTracker
> or the master schedules reduce jobs on slave nodes, so this means
> slave nodes are having their own data structure. If this is correct
> then this orderedSet is also local to slave node, then I don't think
> it is going to work.
> 
> or this data structure is global ? In that case, do I have to take
> care of synchronization ?
> 
> thanks,
> Taran
> 
>> You will have to fix the errors I made in typing this off the cuff, of
>> course.
>> 
>> 
>> 
>> 
>> On 2/4/08 3:19 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>> 
>>> On Feb 4, 2008 3:07 PM, Ted Dunning <td...@veoh.com> wrote:
>>>> 
>>>> Yes, you can do it in one pass, but the reducer will have to accumulate the
>>>> top N items.  If N is small, this is no problem.  If N is large, that is a
>>>> problem.  You also have the problem that the reducer has a close method
>>>> where it can output the accumulated data, but it can't go into the normal
>>>> output channel because you don't have access to the output collector at
>>>> that
>>>> point.
>>>> 
>>> 
>>> Could you elaborate this a bit.
>>> My log file looks like this -
>>> 
>>> keyword  source  dateID
>>> 
>>> right now my mapper output the following as key- value pair
>>> keyword_source_dateID - 1
>>> 
>>> reducer counts the 1s.. and output
>>> keyword_source_dateId  frequency
>>> 
>>> so it is just the word count program so far. I have another program
>>> that identifies the top N keywords. Please tell me how can I modify
>>> the reducer to accumulate top N items. N is small e.g 10
>>> 
>>> thanks,
>>> Taran
>>> 
>>>> In practical situations, your count reducer can eliminate items with very
>>>> small counts that you know cannot be in the top N.  This makes the total
>>>> output much smaller than the input.  This means that making a second pass
>>>> over the data costs very little.  Even without the threshold, the second
>>>> pass will likely be so much faster than the first that it doesn't matter.
>>>> 
>>>> IF you are counting things that satisfy Zipf's law, then the counts will be
>>>> proportional to 1/r where r is the rank of the item.  Using this, you can
>>>> show that the average count for your keywords will be
>>>> 
>>>>      E(k) = N H_m,2 / (H_m)^2
>>>> 
>>>> Where N is the total number of words counted, m is the possible vocabulary,
>>>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
>>>> second order harmonic number (approximately 1.6).
>>>> 
>>>> This means that you should have a compression of approximately
>>>> 
>>>>      1.6 / log(m)^2
>>>> 
>>>> 
>>>> 
>>>> On 2/4/08 2:20 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>>>> 
>>>>> On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>>>> This is exactly the same as word counting, except that you have a second
>>>>>> pass to find the top n per block of data (this can be done in a mapper)
>>>>>> and
>>>>>> then a reducer can quite easily merge the results together.
>>>>>> 
>>>>> 
>>>>> This would mean I have to write a second program that reads the output
>>>>> of first and does the job. I was wondering if it could be done in one
>>>>> program.
>>>> 
>>>> 
>> 
>> 


Re: hadoop: how to find top N frequently occurring words

Posted by Tarandeep Singh <ta...@gmail.com>.
On Feb 4, 2008 3:30 PM, Ted Dunning <td...@veoh.com> wrote:
>
> Approximately this:
>
>    TopNReducer extends MapReduceBase
>           implements Reducer<Text, IntWritable, Text, IntWritable> {
>       OrderedSet<KeyWordIntegerPair> top =
>             new TreeSet<KeyWordIntegerPair>();
>       FileSystem fs;
>
>       void configure(JobConf conf) {
>          fs = FileSystem.get(conf);
>       }
>
>       void reduce(Text keyword, IntWritable counts,
>             OutputCollector<Text, IntWritable> out, Reporter reporter) {
>          int sum = 0;
>          while (counts.hasNext) {
>             sum += counts.next();
>          }
>
>          if (top.size() < 10 || sum > top.first().getCount()) {
>               top.add(new KeyWordIntegerPair(keyword, sum);
>          }
>
>          while (top.size() > 10) {
>              top.remove(0);
>          }
>      }
>
>      void close() {
>          PrintWriter out = new PW(fs.create(new Path("top-counts")));
>          for (v : top) {
>              out.printf("%s\t%d\n", v.keyword(), v.count());
>          }
>      }
>    }
>

Correct me if I am wrong ...
Here Reducer class is using OrderedSet data structure. The JobTracker
or the master schedules reduce jobs on slave nodes, so this means
slave nodes are having their own data structure. If this is correct
then this orderedSet is also local to slave node, then I don't think
it is going to work.

or this data structure is global ? In that case, do I have to take
care of synchronization ?

thanks,
Taran

> You will have to fix the errors I made in typing this off the cuff, of
> course.
>
>
>
>
> On 2/4/08 3:19 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
> > On Feb 4, 2008 3:07 PM, Ted Dunning <td...@veoh.com> wrote:
> >>
> >> Yes, you can do it in one pass, but the reducer will have to accumulate the
> >> top N items.  If N is small, this is no problem.  If N is large, that is a
> >> problem.  You also have the problem that the reducer has a close method
> >> where it can output the accumulated data, but it can't go into the normal
> >> output channel because you don't have access to the output collector at that
> >> point.
> >>
> >
> > Could you elaborate this a bit.
> > My log file looks like this -
> >
> > keyword  source  dateID
> >
> > right now my mapper output the following as key- value pair
> > keyword_source_dateID - 1
> >
> > reducer counts the 1s.. and output
> > keyword_source_dateId  frequency
> >
> > so it is just the word count program so far. I have another program
> > that identifies the top N keywords. Please tell me how can I modify
> > the reducer to accumulate top N items. N is small e.g 10
> >
> > thanks,
> > Taran
> >
> >> In practical situations, your count reducer can eliminate items with very
> >> small counts that you know cannot be in the top N.  This makes the total
> >> output much smaller than the input.  This means that making a second pass
> >> over the data costs very little.  Even without the threshold, the second
> >> pass will likely be so much faster than the first that it doesn't matter.
> >>
> >> IF you are counting things that satisfy Zipf's law, then the counts will be
> >> proportional to 1/r where r is the rank of the item.  Using this, you can
> >> show that the average count for your keywords will be
> >>
> >>      E(k) = N H_m,2 / (H_m)^2
> >>
> >> Where N is the total number of words counted, m is the possible vocabulary,
> >> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
> >> second order harmonic number (approximately 1.6).
> >>
> >> This means that you should have a compression of approximately
> >>
> >>      1.6 / log(m)^2
> >>
> >>
> >>
> >> On 2/4/08 2:20 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> >>
> >>> On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >>>> This is exactly the same as word counting, except that you have a second
> >>>> pass to find the top n per block of data (this can be done in a mapper) and
> >>>> then a reducer can quite easily merge the results together.
> >>>>
> >>>
> >>> This would mean I have to write a second program that reads the output
> >>> of first and does the job. I was wondering if it could be done in one
> >>> program.
> >>
> >>
>
>

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.

Approximately this:

   TopNReducer extends MapReduceBase
          implements Reducer<Text, IntWritable, Text, IntWritable> {
      OrderedSet<KeyWordIntegerPair> top =
            new TreeSet<KeyWordIntegerPair>();
      FileSystem fs;

      void configure(JobConf conf) {
         fs = FileSystem.get(conf);
      }

      void reduce(Text keyword, IntWritable counts,
            OutputCollector<Text, IntWritable> out, Reporter reporter) {
         int sum = 0;
         while (counts.hasNext) {
            sum += counts.next();
         }

         if (top.size() < 10 || sum > top.first().getCount()) {
              top.add(new KeyWordIntegerPair(keyword, sum);
         }

         while (top.size() > 10) {
             top.remove(0);
         }
     }

     void close() {
         PrintWriter out = new PW(fs.create(new Path("top-counts")));
         for (v : top) {
             out.printf("%s\t%d\n", v.keyword(), v.count());
         }
     }
   }

You will have to fix the errors I made in typing this off the cuff, of
course.



On 2/4/08 3:19 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> On Feb 4, 2008 3:07 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> Yes, you can do it in one pass, but the reducer will have to accumulate the
>> top N items.  If N is small, this is no problem.  If N is large, that is a
>> problem.  You also have the problem that the reducer has a close method
>> where it can output the accumulated data, but it can't go into the normal
>> output channel because you don't have access to the output collector at that
>> point.
>> 
> 
> Could you elaborate this a bit.
> My log file looks like this -
> 
> keyword  source  dateID
> 
> right now my mapper output the following as key- value pair
> keyword_source_dateID - 1
> 
> reducer counts the 1s.. and output
> keyword_source_dateId  frequency
> 
> so it is just the word count program so far. I have another program
> that identifies the top N keywords. Please tell me how can I modify
> the reducer to accumulate top N items. N is small e.g 10
> 
> thanks,
> Taran
> 
>> In practical situations, your count reducer can eliminate items with very
>> small counts that you know cannot be in the top N.  This makes the total
>> output much smaller than the input.  This means that making a second pass
>> over the data costs very little.  Even without the threshold, the second
>> pass will likely be so much faster than the first that it doesn't matter.
>> 
>> IF you are counting things that satisfy Zipf's law, then the counts will be
>> proportional to 1/r where r is the rank of the item.  Using this, you can
>> show that the average count for your keywords will be
>> 
>>      E(k) = N H_m,2 / (H_m)^2
>> 
>> Where N is the total number of words counted, m is the possible vocabulary,
>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
>> second order harmonic number (approximately 1.6).
>> 
>> This means that you should have a compression of approximately
>> 
>>      1.6 / log(m)^2
>> 
>> 
>> 
>> On 2/4/08 2:20 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>> 
>>> On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>> This is exactly the same as word counting, except that you have a second
>>>> pass to find the top n per block of data (this can be done in a mapper) and
>>>> then a reducer can quite easily merge the results together.
>>>> 
>>> 
>>> This would mean I have to write a second program that reads the output
>>> of first and does the job. I was wondering if it could be done in one
>>> program.
>> 
>> 


Re: hadoop: how to find top N frequently occurring words

Posted by Tarandeep Singh <ta...@gmail.com>.
On Feb 4, 2008 3:07 PM, Ted Dunning <td...@veoh.com> wrote:
>
> Yes, you can do it in one pass, but the reducer will have to accumulate the
> top N items.  If N is small, this is no problem.  If N is large, that is a
> problem.  You also have the problem that the reducer has a close method
> where it can output the accumulated data, but it can't go into the normal
> output channel because you don't have access to the output collector at that
> point.
>

Could you elaborate this a bit.
My log file looks like this -

keyword  source  dateID

right now my mapper output the following as key- value pair
keyword_source_dateID - 1

reducer counts the 1s.. and output
keyword_source_dateId  frequency

so it is just the word count program so far. I have another program
that identifies the top N keywords. Please tell me how can I modify
the reducer to accumulate top N items. N is small e.g 10

thanks,
Taran

> In practical situations, your count reducer can eliminate items with very
> small counts that you know cannot be in the top N.  This makes the total
> output much smaller than the input.  This means that making a second pass
> over the data costs very little.  Even without the threshold, the second
> pass will likely be so much faster than the first that it doesn't matter.
>
> IF you are counting things that satisfy Zipf's law, then the counts will be
> proportional to 1/r where r is the rank of the item.  Using this, you can
> show that the average count for your keywords will be
>
>      E(k) = N H_m,2 / (H_m)^2
>
> Where N is the total number of words counted, m is the possible vocabulary,
> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
> second order harmonic number (approximately 1.6).
>
> This means that you should have a compression of approximately
>
>      1.6 / log(m)^2
>
>
>
> On 2/4/08 2:20 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
> > On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >> This is exactly the same as word counting, except that you have a second
> >> pass to find the top n per block of data (this can be done in a mapper) and
> >> then a reducer can quite easily merge the results together.
> >>
> >
> > This would mean I have to write a second program that reads the output
> > of first and does the job. I was wondering if it could be done in one
> > program.
>
>

Re: hadoop: how to find top N frequently occurring words

Posted by Tarandeep Singh <ta...@gmail.com>.
On Feb 4, 2008 2:28 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> If by "one program" you mean a single map/reduce, then if you don't have
> much data, you could easily have the mapper store all the input and then
> compute the top-n.

yes I meant, single map/reduce. I do have huge data, so I think I will
go with the second option that you suggested. But still, I would like
to know how to store data in mapper. Could you tell me the API for
that ?

>
> If however you have a lot of data, then the more interesting alternative is
> to use a randomised data-structure (for example a Bloomier Filter) and count
> directly in that.  This would lead to some quantifiable error rate, which
> may be acceptable for your application.
>
Thanks for suggesting this. I didn't know about it. I will read more
about it and hopefully it will solve my problem.

thanks,
Taran

> Miles
>
>
> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> >
> > On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> > > This is exactly the same as word counting, except that you have a second
> > > pass to find the top n per block of data (this can be done in a mapper)
> > and
> > > then a reducer can quite easily merge the results together.
> > >
> >
> > This would mean I have to write a second program that reads the output
> > of first and does the job. I was wondering if it could be done in one
> > program.
> >
> > > This wouldn't be homework, would it?
> > >
> > no, it isn't homework. I read the word count program that came along
> > with hadoop, wanted to extend it to solve my problem.
> >
> > thanks,
> > Taran
> >
> > > MIles
> > >
> > >
> > > On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Can someone guide me on how to write program using hadoop framework
> > > > that analyze the log files and find out the top most frequently
> > > > occurring keywords. The log file has the format -
> > > >
> > > > keyword source dateId
> > > >
> > > > Thanks,
> > > > Tarandeep
> > > >
> > >
> > >
> > >
> > > --
> > > The University of Edinburgh is a charitable body, registered in
> > Scotland,
> > > with registration number SC005336.
> > >
> >
>
>
>
> --
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>

Re: hadoop: how to find top N frequently occurring words

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
If by "one program" you mean a single map/reduce, then if you don't have
much data, you could easily have the mapper store all the input and then
compute the top-n.

If however you have a lot of data, then the more interesting alternative is
to use a randomised data-structure (for example a Bloomier Filter) and count
directly in that.  This would lead to some quantifiable error rate, which
may be acceptable for your application.

Miles

On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>
> On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> > This is exactly the same as word counting, except that you have a second
> > pass to find the top n per block of data (this can be done in a mapper)
> and
> > then a reducer can quite easily merge the results together.
> >
>
> This would mean I have to write a second program that reads the output
> of first and does the job. I was wondering if it could be done in one
> program.
>
> > This wouldn't be homework, would it?
> >
> no, it isn't homework. I read the word count program that came along
> with hadoop, wanted to extend it to solve my problem.
>
> thanks,
> Taran
>
> > MIles
> >
> >
> > On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Can someone guide me on how to write program using hadoop framework
> > > that analyze the log files and find out the top most frequently
> > > occurring keywords. The log file has the format -
> > >
> > > keyword source dateId
> > >
> > > Thanks,
> > > Tarandeep
> > >
> >
> >
> >
> > --
> > The University of Edinburgh is a charitable body, registered in
> Scotland,
> > with registration number SC005336.
> >
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.
Yes, you can do it in one pass, but the reducer will have to accumulate the
top N items.  If N is small, this is no problem.  If N is large, that is a
problem.  You also have the problem that the reducer has a close method
where it can output the accumulated data, but it can't go into the normal
output channel because you don't have access to the output collector at that
point.

In practical situations, your count reducer can eliminate items with very
small counts that you know cannot be in the top N.  This makes the total
output much smaller than the input.  This means that making a second pass
over the data costs very little.  Even without the threshold, the second
pass will likely be so much faster than the first that it doesn't matter.

IF you are counting things that satisfy Zipf's law, then the counts will be
proportional to 1/r where r is the rank of the item.  Using this, you can
show that the average count for your keywords will be

     E(k) = N H_m,2 / (H_m)^2

Where N is the total number of words counted, m is the possible vocabulary,
H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
second order harmonic number (approximately 1.6).

This means that you should have a compression of approximately

     1.6 / log(m)^2


On 2/4/08 2:20 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>> This is exactly the same as word counting, except that you have a second
>> pass to find the top n per block of data (this can be done in a mapper) and
>> then a reducer can quite easily merge the results together.
>> 
> 
> This would mean I have to write a second program that reads the output
> of first and does the job. I was wondering if it could be done in one
> program.


Re: hadoop: how to find top N frequently occurring words

Posted by Tarandeep Singh <ta...@gmail.com>.
On Feb 4, 2008 2:11 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> This is exactly the same as word counting, except that you have a second
> pass to find the top n per block of data (this can be done in a mapper) and
> then a reducer can quite easily merge the results together.
>

This would mean I have to write a second program that reads the output
of first and does the job. I was wondering if it could be done in one
program.

> This wouldn't be homework, would it?
>
no, it isn't homework. I read the word count program that came along
with hadoop, wanted to extend it to solve my problem.

thanks,
Taran

> MIles
>
>
> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> >
> > Hi,
> >
> > Can someone guide me on how to write program using hadoop framework
> > that analyze the log files and find out the top most frequently
> > occurring keywords. The log file has the format -
> >
> > keyword source dateId
> >
> > Thanks,
> > Tarandeep
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>

Re: hadoop: how to find top N frequently occurring words

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
This is exactly the same as word counting, except that you have a second
pass to find the top n per block of data (this can be done in a mapper) and
then a reducer can quite easily merge the results together.

This wouldn't be homework, would it?

MIles

On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>
> Hi,
>
> Can someone guide me on how to write program using hadoop framework
> that analyze the log files and find out the top most frequently
> occurring keywords. The log file has the format -
>
> keyword source dateId
>
> Thanks,
> Tarandeep
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.

Groovy itself seems very stable.

My codebase seems to work well enough to be considered moderately stable in
the sense that it doesn't fall over too often. It is very new code, though.

My codebase is absolutely not stable in the sense of low rate of change.  I
have a long list of known improvements that I need to work on.


On 2/4/08 2:50 PM, "Miles Osborne" <mi...@inf.ed.ac.uk> wrote:

> sorry, I meant Groovy
> 
> Miles
> 
> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>> 
>> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>> How stable is the code?  I could quite easily set some undergraduate
>> project
>>> to do something with it, for example process query logs


Re: Groovy integration details

Posted by Khalil Honsali <k....@gmail.com>.
I am not in a position to decide this, but it'll be clearer if the code is
available (as contrib first ?)...

K. Honsali

On 05/02/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
> The system as it stands supports the following major features
>
> - map-reduce programs can be constructed for interactive, local use or
> hadoop based execution
>
> - map-reduce programs are functions that can be nested and composed
>
> - inputs to map-reduce programs can be strings, lists of strings, local
> files or HDFS files
>
> - outputs are stored in HDFS
>
> - outputs can be consumed by multiple other functions
>
> The current minor(ish) limitations include
>
> - combiners, partition functions and sorting aren't supported yet
>
> - you can't pass conventional java Mappers or Reducers to the framework
>
> - only one input file can be given
>
> - the system doesn't clean up afterwards
>
> These are all easily addressed and should be fixed over the next week or
> two.
>
> The major limitations include:
>
> - only one script can be specified
>
> - additional jars cannot be submitted
>
> - no explicit group/co-group syntactic sugar is provided
>
> These will take a bit longer to resolve.  I hope to incorporate jar
> building
> code similar to that used by the streaming system to address most of this.
> The group/co-group stuff is just the matter of a bit of work.
>
> Pig is very different from this Groovy integration.  They are trying to
> build a new relational algebra language.  I am just trying to write
> map-reduce programs.  They explicitly do not want to support general
> coding
> of functions except in a very limited way or via integration of Java code
> while that is my primary goal.  The other big difference is that my system
> is simple enough that I was able to implement it with a week of coding
> (after a few weeks of noodling about how to make it possible at all).
>
>
>
> On 2/4/08 3:28 PM, "Khalil Honsali" <k....@gmail.com> wrote:
>
> > sorry for the unclarity,
> >
> > - I think I understand that Groovy is already usable and stable, but
> > requires some testing ? what others things required?
> > - what is the next step, i.e., roadmap if any, what evolution / growth
> > direction?
> > - I haven't tried Pig but it also seems to support submitting a function
> to
> > be transformed to map/reduce, though pig is higher level?
> >
> > PS:
> >  -  maybe Groovy requires another mailinglist thread ...
> >
> > K. Honsali
> >
> >
> >
> > On 05/02/2008, Ted Dunning <td...@veoh.com> wrote:
> >>
> >>
> >> Did you mean who, what, when, where and how?
> >>
> >> Who is me.  I am the only author so far.
> >>
> >> What is a groovy/java program that supports running groovy/hadoop
> scripts
> >>
> >> When is nearly now.
> >>
> >> Where is everywhere (this is the internet)
> >>
> >> How is an open question.  I think that Doug's suggested evolution of
> Jira
> >> with patches -> contrib -> sub-project is appropriate.
> >>
> >>
> >> On 2/4/08 2:59 PM, "Khalil Honsali" <k....@gmail.com> wrote:
> >>
> >>> Hi all, Mr. Dunning;
> >>>
> >>> I am interested in the Groovy idea, especially for processing text, I
> >> think
> >>> it can be a good opensource alternative to Google's Sawzall.
> >>>
> >>> Please let me know the 5-Ws of the matter if possible.
> >>>
> >>> K. Honsali
> >>>
> >>> On 05/02/2008, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >>>>
> >>>> sorry, I meant Groovy
> >>>>
> >>>> Miles
> >>>>
> >>>> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> >>>>>
> >>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >>>>>> How stable is the code?  I could quite easily set some
> undergraduate
> >>>>> project
> >>>>>> to do something with it, for example process query logs
> >>>>>>
> >>>>>
> >>>>> I started learning and using hadoop few days back. The program that
> I
> >>>>> have is similar to word count except that it processes a querylog in
> >>>>> special format. I have another program that reads the output of this
> >>>>> program and computes the top N keywords. Want to make it a one
> program
> >>>>> (single map reduce)
> >>>>>
> >>>>> -Taran
> >>>>>
> >>>>>> Miles
> >>>>>>
> >>>>>>
> >>>>>> On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> This is a great opportunity for me to talk about the Groovy
> support
> >>>>> that I
> >>>>>>> have just gotten running.  I am looking for friendly testers as
> this
> >>>>> code
> >>>>>>> is
> >>>>>>> definitely not ready for full release.
> >>>>>>>
> >>>>>>> The program you need in groovy is this:
> >>>>>>>
> >>>>>>> // define the map-reduce function by specifying map and reduce
> >>>>> functions
> >>>>>>> logCount = Hadoop.mr(
> >>>>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
> >>>>>>>    {keyword, counts, out, report ->
> >>>>>>>       sum = 0;
> >>>>>>>       counts.each { sum += it}
> >>>>>>>       out.collect(keyword, sum)
> >>>>>>>    })
> >>>>>>>
> >>>>>>> // apply the function to an input file and collect the results in
> a
> >>>>> map
> >>>>>>> results = [:]
> >>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >>>>>>>     line ->
> >>>>>>>       parts = line.split(\t)
> >>>>>>>       results[parts[0]] = parts[1]
> >>>>>>> }
> >>>>>>>
> >>>>>>> // sort the entries in the map by descending count and print the
> >>>>> results
> >>>>>>> for (x in results.entrySet().sort( {-it.value} )) {
> >>>>>>>    println x
> >>>>>>> }
> >>>>>>>
> >>>>>>> // delete the temporary results
> >>>>>>> Hadoop.cleanup(results)
> >>>>>>>
> >>>>>>> The important points here are:
> >>>>>>>
> >>>>>>> 1) the groovy binding lets you express the map-reduce part of your
> >>>>> program
> >>>>>>> simply.
> >>>>>>>
> >>>>>>> 2) collecting the results is trivial ... You don't have to worry
> >>>> about
> >>>>>>> where
> >>>>>>> or how the results are kept.  You would use the same code to read
> a
> >>>>> local
> >>>>>>> file as to read the results of the map-reduce computation
> >>>>>>>
> >>>>>>> 3) because of (2), you can do some computation locally (the sort)
> >>>> and
> >>>>> some
> >>>>>>> in parallel (the counting).  You could easily translate the sort
> to
> >>>> a
> >>>>>>> hadoop
> >>>>>>> call as well.
> >>>>>>>
> >>>>>>> I know that this doesn't quite answer the question because my
> >>>>>>> groovy-hadoop
> >>>>>>> bridge isn't available yet, but it hopefully will spark some
> >>>> interest.
> >>>>>>>
> >>>>>>> The question I would like to pose to the community is this:
> >>>>>>>
> >>>>>>>   What is the best way to proceed with code like this that is not
> >>>>> ready
> >>>>>>> for
> >>>>>>> prime time, but is ready for others to contribute and possibly
> also
> >>>>> use?
> >>>>>>> Should I follow the Jaql and Cascading course and build a separate
> >>>>>>> repository and web site or should I try to add this as a contrib
> >>>>> package
> >>>>>>> like streaming?  Or should I just hand out source by hand for a
> >>>> little
> >>>>>>> while
> >>>>>>> to get feedback?
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Can someone guide me on how to write program using hadoop
> >>>> framework
> >>>>>>>> that analyze the log files and find out the top most frequently
> >>>>>>>> occurring keywords. The log file has the format -
> >>>>>>>>
> >>>>>>>> keyword source dateId
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Tarandeep
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> The University of Edinburgh is a charitable body, registered in
> >>>>> Scotland,
> >>>>>> with registration number SC005336.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> The University of Edinburgh is a charitable body, registered in
> >> Scotland,
> >>>> with registration number SC005336.
> >>>>
> >>
> >>
>
>

Groovy integration details

Posted by Ted Dunning <td...@veoh.com>.
The system as it stands supports the following major features

- map-reduce programs can be constructed for interactive, local use or
hadoop based execution

- map-reduce programs are functions that can be nested and composed

- inputs to map-reduce programs can be strings, lists of strings, local
files or HDFS files

- outputs are stored in HDFS

- outputs can be consumed by multiple other functions

The current minor(ish) limitations include

- combiners, partition functions and sorting aren't supported yet

- you can't pass conventional java Mappers or Reducers to the framework

- only one input file can be given

- the system doesn't clean up afterwards

These are all easily addressed and should be fixed over the next week or
two.

The major limitations include:

- only one script can be specified

- additional jars cannot be submitted

- no explicit group/co-group syntactic sugar is provided

These will take a bit longer to resolve.  I hope to incorporate jar building
code similar to that used by the streaming system to address most of this.
The group/co-group stuff is just the matter of a bit of work.

Pig is very different from this Groovy integration.  They are trying to
build a new relational algebra language.  I am just trying to write
map-reduce programs.  They explicitly do not want to support general coding
of functions except in a very limited way or via integration of Java code
while that is my primary goal.  The other big difference is that my system
is simple enough that I was able to implement it with a week of coding
(after a few weeks of noodling about how to make it possible at all).



On 2/4/08 3:28 PM, "Khalil Honsali" <k....@gmail.com> wrote:

> sorry for the unclarity,
> 
> - I think I understand that Groovy is already usable and stable, but
> requires some testing ? what others things required?
> - what is the next step, i.e., roadmap if any, what evolution / growth
> direction?
> - I haven't tried Pig but it also seems to support submitting a function to
> be transformed to map/reduce, though pig is higher level?
> 
> PS:
>  -  maybe Groovy requires another mailinglist thread ...
> 
> K. Honsali
> 
> 
> 
> On 05/02/2008, Ted Dunning <td...@veoh.com> wrote:
>> 
>> 
>> Did you mean who, what, when, where and how?
>> 
>> Who is me.  I am the only author so far.
>> 
>> What is a groovy/java program that supports running groovy/hadoop scripts
>> 
>> When is nearly now.
>> 
>> Where is everywhere (this is the internet)
>> 
>> How is an open question.  I think that Doug's suggested evolution of Jira
>> with patches -> contrib -> sub-project is appropriate.
>> 
>> 
>> On 2/4/08 2:59 PM, "Khalil Honsali" <k....@gmail.com> wrote:
>> 
>>> Hi all, Mr. Dunning;
>>> 
>>> I am interested in the Groovy idea, especially for processing text, I
>> think
>>> it can be a good opensource alternative to Google's Sawzall.
>>> 
>>> Please let me know the 5-Ws of the matter if possible.
>>> 
>>> K. Honsali
>>> 
>>> On 05/02/2008, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>> 
>>>> sorry, I meant Groovy
>>>> 
>>>> Miles
>>>> 
>>>> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>>>>> 
>>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>>>> How stable is the code?  I could quite easily set some undergraduate
>>>>> project
>>>>>> to do something with it, for example process query logs
>>>>>> 
>>>>> 
>>>>> I started learning and using hadoop few days back. The program that I
>>>>> have is similar to word count except that it processes a querylog in
>>>>> special format. I have another program that reads the output of this
>>>>> program and computes the top N keywords. Want to make it a one program
>>>>> (single map reduce)
>>>>> 
>>>>> -Taran
>>>>> 
>>>>>> Miles
>>>>>> 
>>>>>> 
>>>>>> On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> This is a great opportunity for me to talk about the Groovy support
>>>>> that I
>>>>>>> have just gotten running.  I am looking for friendly testers as this
>>>>> code
>>>>>>> is
>>>>>>> definitely not ready for full release.
>>>>>>> 
>>>>>>> The program you need in groovy is this:
>>>>>>> 
>>>>>>> // define the map-reduce function by specifying map and reduce
>>>>> functions
>>>>>>> logCount = Hadoop.mr(
>>>>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
>>>>>>>    {keyword, counts, out, report ->
>>>>>>>       sum = 0;
>>>>>>>       counts.each { sum += it}
>>>>>>>       out.collect(keyword, sum)
>>>>>>>    })
>>>>>>> 
>>>>>>> // apply the function to an input file and collect the results in a
>>>>> map
>>>>>>> results = [:]
>>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>>>>>>     line ->
>>>>>>>       parts = line.split(\t)
>>>>>>>       results[parts[0]] = parts[1]
>>>>>>> }
>>>>>>> 
>>>>>>> // sort the entries in the map by descending count and print the
>>>>> results
>>>>>>> for (x in results.entrySet().sort( {-it.value} )) {
>>>>>>>    println x
>>>>>>> }
>>>>>>> 
>>>>>>> // delete the temporary results
>>>>>>> Hadoop.cleanup(results)
>>>>>>> 
>>>>>>> The important points here are:
>>>>>>> 
>>>>>>> 1) the groovy binding lets you express the map-reduce part of your
>>>>> program
>>>>>>> simply.
>>>>>>> 
>>>>>>> 2) collecting the results is trivial ... You don't have to worry
>>>> about
>>>>>>> where
>>>>>>> or how the results are kept.  You would use the same code to read a
>>>>> local
>>>>>>> file as to read the results of the map-reduce computation
>>>>>>> 
>>>>>>> 3) because of (2), you can do some computation locally (the sort)
>>>> and
>>>>> some
>>>>>>> in parallel (the counting).  You could easily translate the sort to
>>>> a
>>>>>>> hadoop
>>>>>>> call as well.
>>>>>>> 
>>>>>>> I know that this doesn't quite answer the question because my
>>>>>>> groovy-hadoop
>>>>>>> bridge isn't available yet, but it hopefully will spark some
>>>> interest.
>>>>>>> 
>>>>>>> The question I would like to pose to the community is this:
>>>>>>> 
>>>>>>>   What is the best way to proceed with code like this that is not
>>>>> ready
>>>>>>> for
>>>>>>> prime time, but is ready for others to contribute and possibly also
>>>>> use?
>>>>>>> Should I follow the Jaql and Cascading course and build a separate
>>>>>>> repository and web site or should I try to add this as a contrib
>>>>> package
>>>>>>> like streaming?  Or should I just hand out source by hand for a
>>>> little
>>>>>>> while
>>>>>>> to get feedback?
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Can someone guide me on how to write program using hadoop
>>>> framework
>>>>>>>> that analyze the log files and find out the top most frequently
>>>>>>>> occurring keywords. The log file has the format -
>>>>>>>> 
>>>>>>>> keyword source dateId
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Tarandeep
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland,
>>>>>> with registration number SC005336.
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>> Scotland,
>>>> with registration number SC005336.
>>>> 
>> 
>> 


Re: hadoop: how to find top N frequently occurring words

Posted by Khalil Honsali <k....@gmail.com>.
sorry for the unclarity,

- I think I understand that Groovy is already usable and stable, but
requires some testing ? what others things required?
- what is the next step, i.e., roadmap if any, what evolution / growth
direction?
- I haven't tried Pig but it also seems to support submitting a function to
be transformed to map/reduce, though pig is higher level?

PS:
 -  maybe Groovy requires another mailinglist thread ...

K. Honsali



On 05/02/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
> Did you mean who, what, when, where and how?
>
> Who is me.  I am the only author so far.
>
> What is a groovy/java program that supports running groovy/hadoop scripts
>
> When is nearly now.
>
> Where is everywhere (this is the internet)
>
> How is an open question.  I think that Doug's suggested evolution of Jira
> with patches -> contrib -> sub-project is appropriate.
>
>
> On 2/4/08 2:59 PM, "Khalil Honsali" <k....@gmail.com> wrote:
>
> > Hi all, Mr. Dunning;
> >
> > I am interested in the Groovy idea, especially for processing text, I
> think
> > it can be a good opensource alternative to Google's Sawzall.
> >
> > Please let me know the 5-Ws of the matter if possible.
> >
> > K. Honsali
> >
> > On 05/02/2008, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >>
> >> sorry, I meant Groovy
> >>
> >> Miles
> >>
> >> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> >>>
> >>> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> >>>> How stable is the code?  I could quite easily set some undergraduate
> >>> project
> >>>> to do something with it, for example process query logs
> >>>>
> >>>
> >>> I started learning and using hadoop few days back. The program that I
> >>> have is similar to word count except that it processes a querylog in
> >>> special format. I have another program that reads the output of this
> >>> program and computes the top N keywords. Want to make it a one program
> >>> (single map reduce)
> >>>
> >>> -Taran
> >>>
> >>>> Miles
> >>>>
> >>>>
> >>>> On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
> >>>>>
> >>>>>
> >>>>> This is a great opportunity for me to talk about the Groovy support
> >>> that I
> >>>>> have just gotten running.  I am looking for friendly testers as this
> >>> code
> >>>>> is
> >>>>> definitely not ready for full release.
> >>>>>
> >>>>> The program you need in groovy is this:
> >>>>>
> >>>>> // define the map-reduce function by specifying map and reduce
> >>> functions
> >>>>> logCount = Hadoop.mr(
> >>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
> >>>>>    {keyword, counts, out, report ->
> >>>>>       sum = 0;
> >>>>>       counts.each { sum += it}
> >>>>>       out.collect(keyword, sum)
> >>>>>    })
> >>>>>
> >>>>> // apply the function to an input file and collect the results in a
> >>> map
> >>>>> results = [:]
> >>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >>>>>     line ->
> >>>>>       parts = line.split(\t)
> >>>>>       results[parts[0]] = parts[1]
> >>>>> }
> >>>>>
> >>>>> // sort the entries in the map by descending count and print the
> >>> results
> >>>>> for (x in results.entrySet().sort( {-it.value} )) {
> >>>>>    println x
> >>>>> }
> >>>>>
> >>>>> // delete the temporary results
> >>>>> Hadoop.cleanup(results)
> >>>>>
> >>>>> The important points here are:
> >>>>>
> >>>>> 1) the groovy binding lets you express the map-reduce part of your
> >>> program
> >>>>> simply.
> >>>>>
> >>>>> 2) collecting the results is trivial ... You don't have to worry
> >> about
> >>>>> where
> >>>>> or how the results are kept.  You would use the same code to read a
> >>> local
> >>>>> file as to read the results of the map-reduce computation
> >>>>>
> >>>>> 3) because of (2), you can do some computation locally (the sort)
> >> and
> >>> some
> >>>>> in parallel (the counting).  You could easily translate the sort to
> >> a
> >>>>> hadoop
> >>>>> call as well.
> >>>>>
> >>>>> I know that this doesn't quite answer the question because my
> >>>>> groovy-hadoop
> >>>>> bridge isn't available yet, but it hopefully will spark some
> >> interest.
> >>>>>
> >>>>> The question I would like to pose to the community is this:
> >>>>>
> >>>>>   What is the best way to proceed with code like this that is not
> >>> ready
> >>>>> for
> >>>>> prime time, but is ready for others to contribute and possibly also
> >>> use?
> >>>>> Should I follow the Jaql and Cascading course and build a separate
> >>>>> repository and web site or should I try to add this as a contrib
> >>> package
> >>>>> like streaming?  Or should I just hand out source by hand for a
> >> little
> >>>>> while
> >>>>> to get feedback?
> >>>>>
> >>>>>
> >>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Can someone guide me on how to write program using hadoop
> >> framework
> >>>>>> that analyze the log files and find out the top most frequently
> >>>>>> occurring keywords. The log file has the format -
> >>>>>>
> >>>>>> keyword source dateId
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Tarandeep
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> The University of Edinburgh is a charitable body, registered in
> >>> Scotland,
> >>>> with registration number SC005336.
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> The University of Edinburgh is a charitable body, registered in
> Scotland,
> >> with registration number SC005336.
> >>
>
>

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.
Did you mean who, what, when, where and how?

Who is me.  I am the only author so far.

What is a groovy/java program that supports running groovy/hadoop scripts

When is nearly now.

Where is everywhere (this is the internet)

How is an open question.  I think that Doug's suggested evolution of Jira
with patches -> contrib -> sub-project is appropriate.


On 2/4/08 2:59 PM, "Khalil Honsali" <k....@gmail.com> wrote:

> Hi all, Mr. Dunning;
> 
> I am interested in the Groovy idea, especially for processing text, I think
> it can be a good opensource alternative to Google's Sawzall.
> 
> Please let me know the 5-Ws of the matter if possible.
> 
> K. Honsali
> 
> On 05/02/2008, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>> 
>> sorry, I meant Groovy
>> 
>> Miles
>> 
>> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>>> 
>>> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>> How stable is the code?  I could quite easily set some undergraduate
>>> project
>>>> to do something with it, for example process query logs
>>>> 
>>> 
>>> I started learning and using hadoop few days back. The program that I
>>> have is similar to word count except that it processes a querylog in
>>> special format. I have another program that reads the output of this
>>> program and computes the top N keywords. Want to make it a one program
>>> (single map reduce)
>>> 
>>> -Taran
>>> 
>>>> Miles
>>>> 
>>>> 
>>>> On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
>>>>> 
>>>>> 
>>>>> This is a great opportunity for me to talk about the Groovy support
>>> that I
>>>>> have just gotten running.  I am looking for friendly testers as this
>>> code
>>>>> is
>>>>> definitely not ready for full release.
>>>>> 
>>>>> The program you need in groovy is this:
>>>>> 
>>>>> // define the map-reduce function by specifying map and reduce
>>> functions
>>>>> logCount = Hadoop.mr(
>>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
>>>>>    {keyword, counts, out, report ->
>>>>>       sum = 0;
>>>>>       counts.each { sum += it}
>>>>>       out.collect(keyword, sum)
>>>>>    })
>>>>> 
>>>>> // apply the function to an input file and collect the results in a
>>> map
>>>>> results = [:]
>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>>>>     line ->
>>>>>       parts = line.split(\t)
>>>>>       results[parts[0]] = parts[1]
>>>>> }
>>>>> 
>>>>> // sort the entries in the map by descending count and print the
>>> results
>>>>> for (x in results.entrySet().sort( {-it.value} )) {
>>>>>    println x
>>>>> }
>>>>> 
>>>>> // delete the temporary results
>>>>> Hadoop.cleanup(results)
>>>>> 
>>>>> The important points here are:
>>>>> 
>>>>> 1) the groovy binding lets you express the map-reduce part of your
>>> program
>>>>> simply.
>>>>> 
>>>>> 2) collecting the results is trivial ... You don't have to worry
>> about
>>>>> where
>>>>> or how the results are kept.  You would use the same code to read a
>>> local
>>>>> file as to read the results of the map-reduce computation
>>>>> 
>>>>> 3) because of (2), you can do some computation locally (the sort)
>> and
>>> some
>>>>> in parallel (the counting).  You could easily translate the sort to
>> a
>>>>> hadoop
>>>>> call as well.
>>>>> 
>>>>> I know that this doesn't quite answer the question because my
>>>>> groovy-hadoop
>>>>> bridge isn't available yet, but it hopefully will spark some
>> interest.
>>>>> 
>>>>> The question I would like to pose to the community is this:
>>>>> 
>>>>>   What is the best way to proceed with code like this that is not
>>> ready
>>>>> for
>>>>> prime time, but is ready for others to contribute and possibly also
>>> use?
>>>>> Should I follow the Jaql and Cascading course and build a separate
>>>>> repository and web site or should I try to add this as a contrib
>>> package
>>>>> like streaming?  Or should I just hand out source by hand for a
>> little
>>>>> while
>>>>> to get feedback?
>>>>> 
>>>>> 
>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Can someone guide me on how to write program using hadoop
>> framework
>>>>>> that analyze the log files and find out the top most frequently
>>>>>> occurring keywords. The log file has the format -
>>>>>> 
>>>>>> keyword source dateId
>>>>>> 
>>>>>> Thanks,
>>>>>> Tarandeep
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland,
>>>> with registration number SC005336.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>> 


Re: hadoop: how to find top N frequently occurring words

Posted by Khalil Honsali <k....@gmail.com>.
Hi all, Mr. Dunning;

I am interested in the Groovy idea, especially for processing text, I think
it can be a good opensource alternative to Google's Sawzall.

Please let me know the 5-Ws of the matter if possible.

K. Honsali

On 05/02/2008, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>
> sorry, I meant Groovy
>
> Miles
>
> On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
> >
> > On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> > > How stable is the code?  I could quite easily set some undergraduate
> > project
> > > to do something with it, for example process query logs
> > >
> >
> > I started learning and using hadoop few days back. The program that I
> > have is similar to word count except that it processes a querylog in
> > special format. I have another program that reads the output of this
> > program and computes the top N keywords. Want to make it a one program
> > (single map reduce)
> >
> > -Taran
> >
> > > Miles
> > >
> > >
> > > On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
> > > >
> > > >
> > > > This is a great opportunity for me to talk about the Groovy support
> > that I
> > > > have just gotten running.  I am looking for friendly testers as this
> > code
> > > > is
> > > > definitely not ready for full release.
> > > >
> > > > The program you need in groovy is this:
> > > >
> > > > // define the map-reduce function by specifying map and reduce
> > functions
> > > > logCount = Hadoop.mr(
> > > >    {key, value, out, report -> out.collect(value.split[0], 1)},
> > > >    {keyword, counts, out, report ->
> > > >       sum = 0;
> > > >       counts.each { sum += it}
> > > >       out.collect(keyword, sum)
> > > >    })
> > > >
> > > > // apply the function to an input file and collect the results in a
> > map
> > > > results = [:]
> > > > LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> > > >     line ->
> > > >       parts = line.split(\t)
> > > >       results[parts[0]] = parts[1]
> > > > }
> > > >
> > > > // sort the entries in the map by descending count and print the
> > results
> > > > for (x in results.entrySet().sort( {-it.value} )) {
> > > >    println x
> > > > }
> > > >
> > > > // delete the temporary results
> > > > Hadoop.cleanup(results)
> > > >
> > > > The important points here are:
> > > >
> > > > 1) the groovy binding lets you express the map-reduce part of your
> > program
> > > > simply.
> > > >
> > > > 2) collecting the results is trivial ... You don't have to worry
> about
> > > > where
> > > > or how the results are kept.  You would use the same code to read a
> > local
> > > > file as to read the results of the map-reduce computation
> > > >
> > > > 3) because of (2), you can do some computation locally (the sort)
> and
> > some
> > > > in parallel (the counting).  You could easily translate the sort to
> a
> > > > hadoop
> > > > call as well.
> > > >
> > > > I know that this doesn't quite answer the question because my
> > > > groovy-hadoop
> > > > bridge isn't available yet, but it hopefully will spark some
> interest.
> > > >
> > > > The question I would like to pose to the community is this:
> > > >
> > > >   What is the best way to proceed with code like this that is not
> > ready
> > > > for
> > > > prime time, but is ready for others to contribute and possibly also
> > use?
> > > > Should I follow the Jaql and Cascading course and build a separate
> > > > repository and web site or should I try to add this as a contrib
> > package
> > > > like streaming?  Or should I just hand out source by hand for a
> little
> > > > while
> > > > to get feedback?
> > > >
> > > >
> > > > On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Can someone guide me on how to write program using hadoop
> framework
> > > > > that analyze the log files and find out the top most frequently
> > > > > occurring keywords. The log file has the format -
> > > > >
> > > > > keyword source dateId
> > > > >
> > > > > Thanks,
> > > > > Tarandeep
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > The University of Edinburgh is a charitable body, registered in
> > Scotland,
> > > with registration number SC005336.
> > >
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>

Re: hadoop: how to find top N frequently occurring words

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
sorry, I meant Groovy

Miles

On 04/02/2008, Tarandeep Singh <ta...@gmail.com> wrote:
>
> On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> > How stable is the code?  I could quite easily set some undergraduate
> project
> > to do something with it, for example process query logs
> >
>
> I started learning and using hadoop few days back. The program that I
> have is similar to word count except that it processes a querylog in
> special format. I have another program that reads the output of this
> program and computes the top N keywords. Want to make it a one program
> (single map reduce)
>
> -Taran
>
> > Miles
> >
> >
> > On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
> > >
> > >
> > > This is a great opportunity for me to talk about the Groovy support
> that I
> > > have just gotten running.  I am looking for friendly testers as this
> code
> > > is
> > > definitely not ready for full release.
> > >
> > > The program you need in groovy is this:
> > >
> > > // define the map-reduce function by specifying map and reduce
> functions
> > > logCount = Hadoop.mr(
> > >    {key, value, out, report -> out.collect(value.split[0], 1)},
> > >    {keyword, counts, out, report ->
> > >       sum = 0;
> > >       counts.each { sum += it}
> > >       out.collect(keyword, sum)
> > >    })
> > >
> > > // apply the function to an input file and collect the results in a
> map
> > > results = [:]
> > > LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> > >     line ->
> > >       parts = line.split(\t)
> > >       results[parts[0]] = parts[1]
> > > }
> > >
> > > // sort the entries in the map by descending count and print the
> results
> > > for (x in results.entrySet().sort( {-it.value} )) {
> > >    println x
> > > }
> > >
> > > // delete the temporary results
> > > Hadoop.cleanup(results)
> > >
> > > The important points here are:
> > >
> > > 1) the groovy binding lets you express the map-reduce part of your
> program
> > > simply.
> > >
> > > 2) collecting the results is trivial ... You don't have to worry about
> > > where
> > > or how the results are kept.  You would use the same code to read a
> local
> > > file as to read the results of the map-reduce computation
> > >
> > > 3) because of (2), you can do some computation locally (the sort) and
> some
> > > in parallel (the counting).  You could easily translate the sort to a
> > > hadoop
> > > call as well.
> > >
> > > I know that this doesn't quite answer the question because my
> > > groovy-hadoop
> > > bridge isn't available yet, but it hopefully will spark some interest.
> > >
> > > The question I would like to pose to the community is this:
> > >
> > >   What is the best way to proceed with code like this that is not
> ready
> > > for
> > > prime time, but is ready for others to contribute and possibly also
> use?
> > > Should I follow the Jaql and Cascading course and build a separate
> > > repository and web site or should I try to add this as a contrib
> package
> > > like streaming?  Or should I just hand out source by hand for a little
> > > while
> > > to get feedback?
> > >
> > >
> > > On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Can someone guide me on how to write program using hadoop framework
> > > > that analyze the log files and find out the top most frequently
> > > > occurring keywords. The log file has the format -
> > > >
> > > > keyword source dateId
> > > >
> > > > Thanks,
> > > > Tarandeep
> > >
> > >
> >
> >
> > --
> >
> > The University of Edinburgh is a charitable body, registered in
> Scotland,
> > with registration number SC005336.
> >
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: hadoop: how to find top N frequently occurring words

Posted by Tarandeep Singh <ta...@gmail.com>.
On Feb 4, 2008 2:40 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> How stable is the code?  I could quite easily set some undergraduate project
> to do something with it, for example process query logs
>

I started learning and using hadoop few days back. The program that I
have is similar to word count except that it processes a querylog in
special format. I have another program that reads the output of this
program and computes the top N keywords. Want to make it a one program
(single map reduce)

-Taran

> Miles
>
>
> On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
> >
> >
> > This is a great opportunity for me to talk about the Groovy support that I
> > have just gotten running.  I am looking for friendly testers as this code
> > is
> > definitely not ready for full release.
> >
> > The program you need in groovy is this:
> >
> > // define the map-reduce function by specifying map and reduce functions
> > logCount = Hadoop.mr(
> >    {key, value, out, report -> out.collect(value.split[0], 1)},
> >    {keyword, counts, out, report ->
> >       sum = 0;
> >       counts.each { sum += it}
> >       out.collect(keyword, sum)
> >    })
> >
> > // apply the function to an input file and collect the results in a map
> > results = [:]
> > LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >     line ->
> >       parts = line.split(\t)
> >       results[parts[0]] = parts[1]
> > }
> >
> > // sort the entries in the map by descending count and print the results
> > for (x in results.entrySet().sort( {-it.value} )) {
> >    println x
> > }
> >
> > // delete the temporary results
> > Hadoop.cleanup(results)
> >
> > The important points here are:
> >
> > 1) the groovy binding lets you express the map-reduce part of your program
> > simply.
> >
> > 2) collecting the results is trivial ... You don't have to worry about
> > where
> > or how the results are kept.  You would use the same code to read a local
> > file as to read the results of the map-reduce computation
> >
> > 3) because of (2), you can do some computation locally (the sort) and some
> > in parallel (the counting).  You could easily translate the sort to a
> > hadoop
> > call as well.
> >
> > I know that this doesn't quite answer the question because my
> > groovy-hadoop
> > bridge isn't available yet, but it hopefully will spark some interest.
> >
> > The question I would like to pose to the community is this:
> >
> >   What is the best way to proceed with code like this that is not ready
> > for
> > prime time, but is ready for others to contribute and possibly also use?
> > Should I follow the Jaql and Cascading course and build a separate
> > repository and web site or should I try to add this as a contrib package
> > like streaming?  Or should I just hand out source by hand for a little
> > while
> > to get feedback?
> >
> >
> > On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Can someone guide me on how to write program using hadoop framework
> > > that analyze the log files and find out the top most frequently
> > > occurring keywords. The log file has the format -
> > >
> > > keyword source dateId
> > >
> > > Thanks,
> > > Tarandeep
> >
> >
>
>
> --
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>

Re: hadoop: how to find top N frequently occurring words

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
How stable is the code?  I could quite easily set some undergraduate project
to do something with it, for example process query logs

Miles

On 04/02/2008, Ted Dunning <td...@veoh.com> wrote:
>
>
> This is a great opportunity for me to talk about the Groovy support that I
> have just gotten running.  I am looking for friendly testers as this code
> is
> definitely not ready for full release.
>
> The program you need in groovy is this:
>
> // define the map-reduce function by specifying map and reduce functions
> logCount = Hadoop.mr(
>    {key, value, out, report -> out.collect(value.split[0], 1)},
>    {keyword, counts, out, report ->
>       sum = 0;
>       counts.each { sum += it}
>       out.collect(keyword, sum)
>    })
>
> // apply the function to an input file and collect the results in a map
> results = [:]
> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>     line ->
>       parts = line.split(\t)
>       results[parts[0]] = parts[1]
> }
>
> // sort the entries in the map by descending count and print the results
> for (x in results.entrySet().sort( {-it.value} )) {
>    println x
> }
>
> // delete the temporary results
> Hadoop.cleanup(results)
>
> The important points here are:
>
> 1) the groovy binding lets you express the map-reduce part of your program
> simply.
>
> 2) collecting the results is trivial ... You don't have to worry about
> where
> or how the results are kept.  You would use the same code to read a local
> file as to read the results of the map-reduce computation
>
> 3) because of (2), you can do some computation locally (the sort) and some
> in parallel (the counting).  You could easily translate the sort to a
> hadoop
> call as well.
>
> I know that this doesn't quite answer the question because my
> groovy-hadoop
> bridge isn't available yet, but it hopefully will spark some interest.
>
> The question I would like to pose to the community is this:
>
>   What is the best way to proceed with code like this that is not ready
> for
> prime time, but is ready for others to contribute and possibly also use?
> Should I follow the Jaql and Cascading course and build a separate
> repository and web site or should I try to add this as a contrib package
> like streaming?  Or should I just hand out source by hand for a little
> while
> to get feedback?
>
>
> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
> > Hi,
> >
> > Can someone guide me on how to write program using hadoop framework
> > that analyze the log files and find out the top most frequently
> > occurring keywords. The log file has the format -
> >
> > keyword source dateId
> >
> > Thanks,
> > Tarandeep
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.

I have created a Jira for the Groovy integration.

https://issues.apache.org/jira/browse/HADOOP-2781

As soon as I can clear the licenses, I will post the code.


On 2/4/08 4:06 PM, "Colin Evans" <co...@metaweb.com> wrote:

> Hi Ted,
> I've been building out a similar framework in JavaScript (Rhino) for
> work that I've been doing at MetaWeb, and we've been thinking about open
> sourcing it too.  It's pretty clear that there are major benefits to
> using a dynamic scripting language with Hadoop.
> 
> I'd love too see how you're tackled this problem and would be interested
> in contributing work to this too.
> 
> -Colin
> 
> 
> 
> Ted Dunning wrote:
>> This is a great opportunity for me to talk about the Groovy support that I
>> have just gotten running.  I am looking for friendly testers as this code is
>> definitely not ready for full release.
>> 
>> The program you need in groovy is this:
>> 
>> // define the map-reduce function by specifying map and reduce functions
>> logCount = Hadoop.mr(
>>    {key, value, out, report -> out.collect(value.split[0], 1)},
>>    {keyword, counts, out, report ->
>>       sum = 0; 
>>       counts.each { sum += it}
>>       out.collect(keyword, sum)
>>    })
>> 
>> // apply the function to an input file and collect the results in a map
>> results = [:]
>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>     line ->
>>       parts = line.split(\t)
>>       results[parts[0]] = parts[1]
>> }
>> 
>> // sort the entries in the map by descending count and print the results
>> for (x in results.entrySet().sort( {-it.value} )) {
>>    println x
>> }
>> 
>> // delete the temporary results
>> Hadoop.cleanup(results)
>> 
>> The important points here are:
>> 
>> 1) the groovy binding lets you express the map-reduce part of your program
>> simply.
>> 
>> 2) collecting the results is trivial ... You don't have to worry about where
>> or how the results are kept.  You would use the same code to read a local
>> file as to read the results of the map-reduce computation
>> 
>> 3) because of (2), you can do some computation locally (the sort) and some
>> in parallel (the counting).  You could easily translate the sort to a hadoop
>> call as well.
>> 
>> I know that this doesn't quite answer the question because my groovy-hadoop
>> bridge isn't available yet, but it hopefully will spark some interest.
>> 
>> The question I would like to pose to the community is this:
>> 
>>   What is the best way to proceed with code like this that is not ready for
>> prime time, but is ready for others to contribute and possibly also use?
>> Should I follow the Jaql and Cascading course and build a separate
>> repository and web site or should I try to add this as a contrib package
>> like streaming?  Or should I just hand out source by hand for a little while
>> to get feedback?
>> 
>> 
>> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>> 
>>   
>>> Hi,
>>> 
>>> Can someone guide me on how to write program using hadoop framework
>>> that analyze the log files and find out the top most frequently
>>> occurring keywords. The log file has the format -
>>> 
>>> keyword source dateId
>>> 
>>> Thanks,
>>> Tarandeep
>>>     
>> 
>>   
> 


Re: hadoop: how to find top N frequently occurring words

Posted by Colin Evans <co...@metaweb.com>.
Hi Ted,
I've been building out a similar framework in JavaScript (Rhino) for 
work that I've been doing at MetaWeb, and we've been thinking about open 
sourcing it too.  It's pretty clear that there are major benefits to 
using a dynamic scripting language with Hadoop. 

I'd love too see how you're tackled this problem and would be interested 
in contributing work to this too.

-Colin



Ted Dunning wrote:
> This is a great opportunity for me to talk about the Groovy support that I
> have just gotten running.  I am looking for friendly testers as this code is
> definitely not ready for full release.
>
> The program you need in groovy is this:
>
> // define the map-reduce function by specifying map and reduce functions
> logCount = Hadoop.mr(
>    {key, value, out, report -> out.collect(value.split[0], 1)},
>    {keyword, counts, out, report ->
>       sum = 0; 
>       counts.each { sum += it}
>       out.collect(keyword, sum)
>    })
>
> // apply the function to an input file and collect the results in a map
> results = [:]
> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>     line ->
>       parts = line.split(\t)
>       results[parts[0]] = parts[1]
> }
>
> // sort the entries in the map by descending count and print the results
> for (x in results.entrySet().sort( {-it.value} )) {
>    println x
> }
>
> // delete the temporary results
> Hadoop.cleanup(results)
>
> The important points here are:
>
> 1) the groovy binding lets you express the map-reduce part of your program
> simply.
>
> 2) collecting the results is trivial ... You don't have to worry about where
> or how the results are kept.  You would use the same code to read a local
> file as to read the results of the map-reduce computation
>
> 3) because of (2), you can do some computation locally (the sort) and some
> in parallel (the counting).  You could easily translate the sort to a hadoop
> call as well.
>
> I know that this doesn't quite answer the question because my groovy-hadoop
> bridge isn't available yet, but it hopefully will spark some interest.
>
> The question I would like to pose to the community is this:
>
>   What is the best way to proceed with code like this that is not ready for
> prime time, but is ready for others to contribute and possibly also use?
> Should I follow the Jaql and Cascading course and build a separate
> repository and web site or should I try to add this as a contrib package
> like streaming?  Or should I just hand out source by hand for a little while
> to get feedback?
>
>
> On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
>   
>> Hi,
>>
>> Can someone guide me on how to write program using hadoop framework
>> that analyze the log files and find out the top most frequently
>> occurring keywords. The log file has the format -
>>
>> keyword source dateId
>>
>> Thanks,
>> Tarandeep
>>     
>
>   


Re: hadoop: how to find top N frequently occurring words

Posted by Doug Cutting <cu...@apache.org>.
Ted Dunning wrote:
> The question I would like to pose to the community is this:
> 
>   What is the best way to proceed with code like this that is not ready for
> prime time, but is ready for others to contribute and possibly also use?
> Should I follow the Jaql and Cascading course and build a separate
> repository and web site or should I try to add this as a contrib package
> like streaming?  Or should I just hand out source by hand for a little while
> to get feedback?

In part it depends on how actively you think it will be developed and by 
how many developers.

If you think it'll mostly just be you, plus an occasional patch from 
others, then attaching it to a Jira issue with the intent of maintaining 
it in contrib would be reasonable.  After a few patches, we could make 
you a contrib committer so that you can maintain it there.

However if you think it will be actively maintained by a more sizable 
group of developers, then a separate repo with independent release 
cycles could be more workable.  This could be a Hadoop subproject, or a 
project hosted outside Apache.  Apache imposes more overhead on 
projects, especially  on startup, than places like Sourceforge or Google 
Code, but correspondingly provides more guarantees about IP and 
community.  You could start a project outside of Apache and move it in, 
but then it has to go through the Incubator.  Or you could start 
something as a contrib module within Hadoop Core, and, once it reaches a 
critical mass, try to promote it to a separate subproject.

Lots of possibilities...

Doug

Re: hadoop: how to find top N frequently occurring words

Posted by Ted Dunning <td...@veoh.com>.
This is a great opportunity for me to talk about the Groovy support that I
have just gotten running.  I am looking for friendly testers as this code is
definitely not ready for full release.

The program you need in groovy is this:

// define the map-reduce function by specifying map and reduce functions
logCount = Hadoop.mr(
   {key, value, out, report -> out.collect(value.split[0], 1)},
   {keyword, counts, out, report ->
      sum = 0; 
      counts.each { sum += it}
      out.collect(keyword, sum)
   })

// apply the function to an input file and collect the results in a map
results = [:]
LogCount(inputFileEitherLocallyOnHDFS).eachLine {
    line ->
      parts = line.split(\t)
      results[parts[0]] = parts[1]
}

// sort the entries in the map by descending count and print the results
for (x in results.entrySet().sort( {-it.value} )) {
   println x
}

// delete the temporary results
Hadoop.cleanup(results)

The important points here are:

1) the groovy binding lets you express the map-reduce part of your program
simply.

2) collecting the results is trivial ... You don't have to worry about where
or how the results are kept.  You would use the same code to read a local
file as to read the results of the map-reduce computation

3) because of (2), you can do some computation locally (the sort) and some
in parallel (the counting).  You could easily translate the sort to a hadoop
call as well.

I know that this doesn't quite answer the question because my groovy-hadoop
bridge isn't available yet, but it hopefully will spark some interest.

The question I would like to pose to the community is this:

  What is the best way to proceed with code like this that is not ready for
prime time, but is ready for others to contribute and possibly also use?
Should I follow the Jaql and Cascading course and build a separate
repository and web site or should I try to add this as a contrib package
like streaming?  Or should I just hand out source by hand for a little while
to get feedback?


On 2/4/08 2:04 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> Hi,
> 
> Can someone guide me on how to write program using hadoop framework
> that analyze the log files and find out the top most frequently
> occurring keywords. The log file has the format -
> 
> keyword source dateId
> 
> Thanks,
> Tarandeep