You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Vasco Visser <va...@gmail.com> on 2013/01/03 01:47:38 UTC

Re: Pairwise Comparison of Large Datasets

Hi Rob,

Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html)

What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.229.1890&rep=rep1&type=pdf)
Which features a similar grid like approach, but with some smart tricks.

Also you probably like Jimmy Lin's articles on pairwise similarity in
MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html).

best, Vasco

On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <ro...@dynamicorange.com> wrote:
> Happy New Year :)
>
> Thought some of you might find this useful.
>
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
>
> http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/
>
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
>
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!
>
> thanks
>
> rob