You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by parnab kumar <pa...@gmail.com> on 2014/06/20 20:51:12 UTC

grouping similar items toegther

Hi,

    I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
are similar if their corresponding hamming distance is less than equal to 2.

I need to group together hashes that are mutually similar to one another
i.e in the output file in each line i should have mutually similar keys.

I implemented a customer writable and the compareTo method looks  as
follows :

*public int compareTo(Object o) {*
* Long thisHash = this.hash*
* Long thatHash = ((DocumentHash)o).hash.;*
* if(hammingDist(thisHash, thatHash)<=2){*
* return 0;*
* }*
 * return thisHash.compareTo(thatHash);*
* }*


In the Map function I emit the customWritable as the key and in the reduce
group by the keys.

I checked the output file and exhaustively tested the hashes manually and
found that most hashes are mutually similar in each line. However, i found
that some hashes even though they are similar to a group are not in the
output.

For example: consider the following hashes :

HASH1 = 69215512
HASH2 =  69215512
HASH3 =  69215512
HASH4 = 69215568

All the above 4 hashes are mutually similar and are within a distance 2 of
each other. Still in the output file i found two separate records where
HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
as follows:

HASH4    HASH3
HASH1    HASH2


Can someone specify why the above happens ???


Thanks,
Parnab.

Re: grouping similar items toegther

Posted by Stanley Shi <ss...@gopivotal.com>.
The "similar" logic is not transitive, that means, if a is similar to b, b
is similar to c, but a may be not similar to c;
then how do you do the group?

Regards,
*Stanley Shi,*



On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Chris Mawata <ch...@gmail.com>.
1. We can't see your reduce algorithm so we can't tell you why the 'group'
you think should work is not working.
2. The relation you have is not transitive so you will not have equivalence
classes.
Chris
On Jun 20, 2014 2:51 PM, "parnab kumar" <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Stanley Shi <ss...@gopivotal.com>.
The "similar" logic is not transitive, that means, if a is similar to b, b
is similar to c, but a may be not similar to c;
then how do you do the group?

Regards,
*Stanley Shi,*



On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Stanley Shi <ss...@gopivotal.com>.
The "similar" logic is not transitive, that means, if a is similar to b, b
is similar to c, but a may be not similar to c;
then how do you do the group?

Regards,
*Stanley Shi,*



On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Chris Mawata <ch...@gmail.com>.
1. We can't see your reduce algorithm so we can't tell you why the 'group'
you think should work is not working.
2. The relation you have is not transitive so you will not have equivalence
classes.
Chris
On Jun 20, 2014 2:51 PM, "parnab kumar" <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Chris Mawata <ch...@gmail.com>.
1. We can't see your reduce algorithm so we can't tell you why the 'group'
you think should work is not working.
2. The relation you have is not transitive so you will not have equivalence
classes.
Chris
On Jun 20, 2014 2:51 PM, "parnab kumar" <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Stanley Shi <ss...@gopivotal.com>.
The "similar" logic is not transitive, that means, if a is similar to b, b
is similar to c, but a may be not similar to c;
then how do you do the group?

Regards,
*Stanley Shi,*



On Sat, Jun 21, 2014 at 2:51 AM, parnab kumar <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>

Re: grouping similar items toegther

Posted by Chris Mawata <ch...@gmail.com>.
1. We can't see your reduce algorithm so we can't tell you why the 'group'
you think should work is not working.
2. The relation you have is not transitive so you will not have equivalence
classes.
Chris
On Jun 20, 2014 2:51 PM, "parnab kumar" <pa...@gmail.com> wrote:

> Hi,
>
>     I have a set of hashes. Each Hash is a 32 bit Long Integer. Two hashes
> are similar if their corresponding hamming distance is less than equal to 2.
>
> I need to group together hashes that are mutually similar to one another
> i.e in the output file in each line i should have mutually similar keys.
>
> I implemented a customer writable and the compareTo method looks  as
> follows :
>
> *public int compareTo(Object o) {*
> * Long thisHash = this.hash*
> * Long thatHash = ((DocumentHash)o).hash.;*
> * if(hammingDist(thisHash, thatHash)<=2){*
> * return 0;*
> * }*
>  * return thisHash.compareTo(thatHash);*
> * }*
>
>
> In the Map function I emit the customWritable as the key and in the reduce
> group by the keys.
>
> I checked the output file and exhaustively tested the hashes manually and
> found that most hashes are mutually similar in each line. However, i found
> that some hashes even though they are similar to a group are not in the
> output.
>
> For example: consider the following hashes :
>
> HASH1 = 69215512
> HASH2 =  69215512
> HASH3 =  69215512
> HASH4 = 69215568
>
> All the above 4 hashes are mutually similar and are within a distance 2 of
> each other. Still in the output file i found two separate records where
> HASH1 and HASH2 occurs in one line and HASH3 and HASH4 occurs in other line
> as follows:
>
> HASH4    HASH3
> HASH1    HASH2
>
>
> Can someone specify why the above happens ???
>
>
> Thanks,
> Parnab.
>
>
>