You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2013/08/14 19:06:20 UTC

How do I perform a scalable cartesian product

  I have the problem of performing a operation of a data set on itself.

   Assume, for example, that I have a list of people and their addresses
and for each person I want the ten closest members of the set. (this is not
the problem but illustrated critical aspects). I know that the ten closest
people will be in the same zipcode or a neighboring zip code. This means
unless the database is very large I can have the mapper send every person
out with keys representing  their zipcode and also keys representing the
neighboring zip codes. In the reducer I can keep all people in memory and
compute distances between them (assume the distance computation is slightly
expensive).
   The problem is that this approach will not scale - eventually the number
of people assigned to a zip code will exceed memory. In the current problem
the number of "people" is about 100 million and doubling every 6 months.
The size of a "zipcode" requires keeping about 100,000 items in memory -
doable today but marginal in terms of future growth.
   Are there other ways to solve the problem. I considered keeping a random
subset, finding the closest in that subset and then repeating with
different random subsets. The solution of midifying the splitter to
generate all pairs
https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
will
not work for a dataset with 100 million items
   Any bright ideas?

Re: How do I perform a scalable cartesian product

Posted by Christoph Schmitz <ch...@1und1.de>.
Hi Steve,

if the only problem is that the size of your zipcode squared (more 
accurately, n * (n-1) if you order the pairs of persons, assuming that 
distance is symmetric) is too large, it might help to split the zipcode 
into buckets by some hash function and partition the all-pairs 
computations over buckets.

That is, if you have 10 buckets containing the users of a zipcode and 
its neighboring zipcodes, first compute all pairs of persons within 
bucket 1, then all pairs of persons in bucket 1 and 2, then 1-3, 1-4 
etc. up to 10-10. Obviously, you don't need to do 4-1 if you've already 
done 1-4 (symmetry, see above), so you'll end up doing n * (n+1) pairs 
of buckets (55 in this case).

Basically, this means creating artificial, smaller zipcodes.

Apart from that, I'd like to point out that there's been a lot of 
research on nearest-neighbor search; perhaps some state-of-the-art 
algorithm will be applicable to your problem.

http://en.wikipedia.org/wiki/Nearest_neighbor_search

Hope this helps,

Christoph


On 14.08.2013 19:06, Steve Lewis wrote:
>    I have the problem of performing a operation of a data set on itself.
>
>     Assume, for example, that I have a list of people and their
> addresses and for each person I want the ten closest members of the set.
> (this is not the problem but illustrated critical aspects). I know that
> the ten closest people will be in the same zipcode or a neighboring zip
> code. This means unless the database is very large I can have the mapper
> send every person out with keys representing  their zipcode and also
> keys representing the neighboring zip codes. In the reducer I can keep
> all people in memory and compute distances between them (assume the
> distance computation is slightly expensive).
>     The problem is that this approach will not scale - eventually the
> number of people assigned to a zip code will exceed memory. In the
> current problem the number of "people" is about 100 million and doubling
> every 6 months. The size of a "zipcode" requires keeping about 100,000
> items in memory - doable today but marginal in terms of future growth.
>     Are there other ways to solve the problem. I considered keeping a
> random subset, finding the closest in that subset and then repeating
> with different random subsets. The solution of midifying the splitter to
> generate all pairs
> https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java will
> not work for a dataset with 100 million items
>     Any bright ideas?
>
>
>
>

-- 
Christoph Schmitz
Software-Architekt
Targeting Core Product

1&1 Internet AG | Brauerstraße 50 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-6733
E-Mail: christoph.schmitz@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas 
Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, 
Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte 
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat 
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte 
den Absender und vernichten Sie diese Email. Anderen als dem 
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, 
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.

This E-Mail may contain confidential and/or privileged information. If 
you are not the intended recipient of this E-Mail, you are hereby 
notified that saving, distribution or use of the content of this E-Mail 
in any way is prohibited. If you have received this E-Mail in error, 
please notify the sender and delete the E-Mail.