You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by João Paulo Forny <jp...@gmail.com> on 2014/02/28 18:37:01 UTC

Reduce side join of similar records

I'm implementing a join between two datasets A and B by a String key, which
is the name attribute. I need to match similar names in this join.

My first thought, given that I was implementing secondary sort to get the
values extracted from database A before the values from database B, was to
create a grouping comparator class and instead of using the compareTo
method to group values by the natural key, use a string similarity
algorithm, but it has not worked as expected, since that names that match
in my algorithm wasn't mapped as the same key. See my code below.

public class StringSimilarityGroupingComparator extends WritableComparator {

protected StringSimilarityGroupingComparator() {
    super(JoinKeyTagPairWritable.class, true);
}

public int compare(WritableComparable w1, WritableComparable w2) {
    JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
    JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
    StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
            StringSimilarityMatcher.NAME_MATCH);

    return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
            .getJoinKey().compareTo(k2.getJoinKey());
}

This approach makes total sense to me. Where was I mistaken? Isn't this the
purpose of overriding the grouping comparator class?