You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/01/15 00:44:25 UTC

probably very stupid question

Hi,
  Probably a very lame question.
I have two documents and I want to find the overlap of both documents in
map reduce fashion and then compare the overlap (lets say I have some
measure to do that)

SO this is what I am thinking:

    1) Run the normal wordcount job on one document (
https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework
)
    2) But rather than saving a file, save everything in a
HashMap(word,true)
    3) Pass that HashMap along the second wordcount mapreduce program and
then as I am processing the second document, check the words against the
HashMap to find whether the word is present or not.

So, something like this

     1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
     2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
I know there can be some other hacks but what I am trying to achieve is get
comfortable with the java framework..
So, from the above link.. how do i save the datastrcuture instead of file.
How do I pass the datastructure as an argument?

Re: probably very stupid question

Posted by be...@gmail.com.
Hi Jamal

I believe a reduce side join is what you are looking for. 

You can use MultipleInputs and achieve a reduce side join to achieve this.

http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Mon, 14 Jan 2013 15:44:25 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: probably very stupid question

Hi,
  Probably a very lame question.
I have two documents and I want to find the overlap of both documents in
map reduce fashion and then compare the overlap (lets say I have some
measure to do that)

SO this is what I am thinking:

    1) Run the normal wordcount job on one document (
https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework
)
    2) But rather than saving a file, save everything in a
HashMap(word,true)
    3) Pass that HashMap along the second wordcount mapreduce program and
then as I am processing the second document, check the words against the
HashMap to find whether the word is present or not.

So, something like this

     1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
     2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
I know there can be some other hacks but what I am trying to achieve is get
comfortable with the java framework..
So, from the above link.. how do i save the datastrcuture instead of file.
How do I pass the datastructure as an argument?


Re: probably very stupid question

Posted by be...@gmail.com.
Hi Jamal

I believe a reduce side join is what you are looking for. 

You can use MultipleInputs and achieve a reduce side join to achieve this.

http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Mon, 14 Jan 2013 15:44:25 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: probably very stupid question

Hi,
  Probably a very lame question.
I have two documents and I want to find the overlap of both documents in
map reduce fashion and then compare the overlap (lets say I have some
measure to do that)

SO this is what I am thinking:

    1) Run the normal wordcount job on one document (
https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework
)
    2) But rather than saving a file, save everything in a
HashMap(word,true)
    3) Pass that HashMap along the second wordcount mapreduce program and
then as I am processing the second document, check the words against the
HashMap to find whether the word is present or not.

So, something like this

     1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
     2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
I know there can be some other hacks but what I am trying to achieve is get
comfortable with the java framework..
So, from the above link.. how do i save the datastrcuture instead of file.
How do I pass the datastructure as an argument?


Re: probably very stupid question

Posted by be...@gmail.com.
Hi Jamal

I believe a reduce side join is what you are looking for. 

You can use MultipleInputs and achieve a reduce side join to achieve this.

http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Mon, 14 Jan 2013 15:44:25 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: probably very stupid question

Hi,
  Probably a very lame question.
I have two documents and I want to find the overlap of both documents in
map reduce fashion and then compare the overlap (lets say I have some
measure to do that)

SO this is what I am thinking:

    1) Run the normal wordcount job on one document (
https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework
)
    2) But rather than saving a file, save everything in a
HashMap(word,true)
    3) Pass that HashMap along the second wordcount mapreduce program and
then as I am processing the second document, check the words against the
HashMap to find whether the word is present or not.

So, something like this

     1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
     2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
I know there can be some other hacks but what I am trying to achieve is get
comfortable with the java framework..
So, from the above link.. how do i save the datastrcuture instead of file.
How do I pass the datastructure as an argument?


Re: probably very stupid question

Posted by be...@gmail.com.
Hi Jamal

I believe a reduce side join is what you are looking for. 

You can use MultipleInputs and achieve a reduce side join to achieve this.

http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Mon, 14 Jan 2013 15:44:25 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: probably very stupid question

Hi,
  Probably a very lame question.
I have two documents and I want to find the overlap of both documents in
map reduce fashion and then compare the overlap (lets say I have some
measure to do that)

SO this is what I am thinking:

    1) Run the normal wordcount job on one document (
https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework
)
    2) But rather than saving a file, save everything in a
HashMap(word,true)
    3) Pass that HashMap along the second wordcount mapreduce program and
then as I am processing the second document, check the words against the
HashMap to find whether the word is present or not.

So, something like this

     1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
     2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
I know there can be some other hacks but what I am trying to achieve is get
comfortable with the java framework..
So, from the above link.. how do i save the datastrcuture instead of file.
How do I pass the datastructure as an argument?