You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "W.P. McNeill" <bi...@gmail.com> on 2011/04/11 18:30:55 UTC

Using global reverse lookup tables

I understand that part of the rules of MapReduce is that there's no shared
global information; nevertheless I have a problem that requires shared
global information and I'm trying to get a sense of what mechanisms are
available to address it.

I have a bunch of *sets* built on a vocabulary of *elements*.  In the
following, for example:

X {a, b, c}
Y {b, c}
Z {a, c, d}

X,Y, and Z are the sets and a, b, c, and d are the elements. (Maybe the sets
are web pages and the elements are keywords; it doesn't matter.)

Given all the sets I need to build a reverse lookup table of elements to
sets.

a {X, Z}
b {X, Y}
c {X, Y, Z}
d {Z}

Then given a set I need to emit a proper subset of it whose composition is a
function of the values of its elements in the lookup table.  So given Y =
{b, c}, I need to look up b = {X, Y} and c = {X, Y, Z} in order to generate
the output I need.

I have a large number (order of 10^9) of both sets and elements, so need to
use a scalable system like MapReduce.

Building the lookup table from the sets is easy--that's what MapReduce is
designed to do. However, I'm not sure how to make use of this table in a
subsequent MapReduce process because it is shared global information.

Are there general approaches to solving this problem? I've been trying to
come up with some clever secondary-sort style key arithmetic, but I don't
think that's the answer. Should I just build a BDB database and pack it into
the distributed cache? (I'm reluctant to, because that database is going to
be really big.  It seems like it should live in a single location on HDFS.)
Are the higher level tools like Hive or Cascading the right solution here?

Re: Using global reverse lookup tables

Posted by Ted Dunning <td...@maprtech.com>.

On Fri, Apr 15, 2011 at 11:45 AM, W.P. McNeill <bi...@gmail.com> wrote:

> Thanks for your answer. After mulling over this problem for a few days, I
> believe there might be a clearer way for me to phrase to question, so let me
> try that before diving into the specifics of the linear algebra analysis you
> give.
>
> I need to share an inverted index of elements to sets as described above.
> And crucially this index is *immutable*: after it has been created it only
> has to be read from, never written to. So a clearer way to phrase this
> question is: how do I share a large read-only inverted index among multiple
> MapReduce jobs?
>
> I can think of two approaches.
>
>    1. Treat it as a database JOIN on elements operation between the
>    original table of sets and the inverted index. This is the tack that Ted was
>    suggesting in his response.
>    2. Put the inverted index into a MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>.
>    The individual jobs load the inverted index at setup() time and do random
>    access reads from it as needed.
>
> A few questions:
>
>    1. Do others agree that these are the two big classes of solution?
>
> Roughly.

>
>    1. Do people have a sense of what the pros and cons of each might be?
>    (BTW quadratic runtime in the density of the set membership rows is probably
>    not a problem; the sets I am dealing with are small relative to the
>    vocabulary size and relatively disjoint.)
>
> If the intersection density is very low then the inverted index approach is
likely to be much better.  One helpful tweak would be to put a Bloom filter
in front of the MapFile to avoid probing if you aren't going to get a hit.
 Count on about 15-20 bits in the Bloom filter for each element in the
inverted index to get acceptable false positive rates.  Can you afford 2-3
bytes of memory use per element in the MapFile?

Another tweak is to sort the input and do the lookups in the reduce.  This
will make the accesses to the MapFiles much faster since the accesses will
be much more sequential.  This can allow significant caching and might make
the Bloom filter a bit of an albatross.

Finally, this might even be an interesting job for Hbase (storing your
inverted index).



>
>    1. Is Pig or Hive a good tool to use for solution (1)? (I have a
>    feeling the answer might be a 10-line Pig script, but I don't have enough
>    SQL experience to just knock one out.)
>    2. For solution (2), will MapFile scale to a map with 10^9 entries?
>    (Assuming I use the io.map.index.skip property to make the right
>    search-speed/memory tradeoff for my configuration.)
>
>
>
>
>

Re: Using global reverse lookup tables

Posted by "W.P. McNeill" <bi...@gmail.com>.

Thanks for your answer. After mulling over this problem for a few days, I
believe there might be a clearer way for me to phrase to question, so let me
try that before diving into the specifics of the linear algebra analysis you
give.

I need to share an inverted index of elements to sets as described above.
And crucially this index is *immutable*: after it has been created it only
has to be read from, never written to. So a clearer way to phrase this
question is: how do I share a large read-only inverted index among multiple
MapReduce jobs?

I can think of two approaches.

1. Treat it as a database JOIN on elements operation between the original
table of sets and the inverted index. This is the tack that Ted was
suggesting in his response.
2. Put the inverted index into a
MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>.
The individual jobs load the inverted index at setup() time and do random
access reads from it as needed.

A few questions:

1. Do others agree that these are the two big classes of solution?
2. Do people have a sense of what the pros and cons of each might be?
(BTW quadratic runtime in the density of the set membership rows is probably
not a problem; the sets I am dealing with are small relative to the
vocabulary size and relatively disjoint.)
3. Is Pig or Hive a good tool to use for solution (1)? (I have a feeling
the answer might be a 10-line Pig script, but I don't have enough SQL
experience to just knock one out.)
4. For solution (2), will MapFile scale to a map with 10^9 entries?
(Assuming I use the io.map.index.skip property to make the right
search-speed/memory tradeoff for my configuration.)

Re: Using global reverse lookup tables

Posted by Ted Dunning <td...@maprtech.com>.

Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.

That is, it sounds like you want to find all the sets that share elements
with X.  If you have a binary matrix A that represents your set membership
with one row per set and one column per element, then you want to compute A
A'.   It is common for A to be available in row-major form or in the form of
pairs.  With row major form, the easiest way to compute your desired result
is to transpose A in a first map-reduce.  With either the transposed matrix
a second map-reduce can be used in which the mapper reads all of the sets
with a particular element and emits pairs of sets that have a common
element.  The combiner and reducer are basically a pair counter.

This implementation suffers in that it takes time quadratic in the density
of the most dense row.  It is common to downsample such rows to a reasonable
level.  Most uses of the cooccurrence matrix don't care about this
downsampling and the makes the algorithm much faster.

As an alternative, you can compute a matrix decomposition and use that to
compute A A'.  This can be arranged so as to avoid the downsampling, but the
program required is much more complex.  The Apache Mahout project has
several implementations of such decompositions tuned for different
situations.  Some implementations use map-reduce, some are sequential.


On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill <bi...@gmail.com> wrote:

> Are there general approaches to solving this problem?