You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Thibaut_ <tb...@blue.lu> on 2009/02/11 22:39:07 UTC

Finding small subset in very large dataset

Hi,

Let's say the smaller subset has name A. It is a relatively small collection
< 100 000 entries (could also be only 100), with nearly no payload as value.
Collection B is a big collection with >10 000 000 entries (Each key of A
also exists in the collection B), where the value for each key is relatively
big (> 100 KB)

For all the keys in A, I need to get the corresponding value from B and
collect it in the output.

- I can do this by reading in both files, and on the reduce step, do my
computations and collect only those which are both in A and B. The map phase
however will take very long as all the key/value pairs of collection B need
to be sorted (and each key's value is >100 KB) at the end of the map phase,
which is overkill if A is very small.

What I would need is an option to somehow make the intersection first
(Mapper only on keys, then a reduce functino based only on keys and not the
corresponding values which collects the keys I want to take), and then
running the map input and filtering the output collector or the input based
on the results from the reduce phase.

Or is there another faster way? Collection A could be so big that it doesn't
fit into the memory. I could split collection A up into multiple smaller
collections, but that would make it more complicated, so I want to evade
that route. (This is similar to the approach I described above, just a
manual approach)

Thanks,
Thibaut
--
View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

if i remember correctly you have two sets of data:

--set A, which is very big
--set B, which is small

and you want to find all elements of A which are in B.  right?

represent A using a variant of a Bloom Filter which supports key-value
pairs.  A Bloomier Filter will do this for you.

each mapper then loads-up A (represented using the Bloomier Filter)
and works over B . whenever A  is present in the representation of B,
you look for the associated value in B and emit that.

if even using a Bloomier Filter you still need too much memory then
you could store it once using Hypertable

see here for an explanation of Bloomier Filters applied to the task of
storing lots of string,probability pairs

Randomized Language Models via Perfect Hash Functions
http://aclweb.org/anthology-new/P/P08/P08-1058.pdf
Miles

2009/2/18 Thibaut_ <tb...@blue.lu>:
>
> Hi Miles,
>
> I'm not following you.
> If I'm saving an associated hash or bit vector, how can I then quickly
> access the elements afterwards (the file with the data might be >100GB big
> and is on the DFS)?
>
> I could also directly save the offset of the data in the datafile as
> reference, and then on each reducer read in that big file only once. As all
> the keys are sorted, I can get all the needed values in one big read step
> (skipping those entries I don't need).
>
>
> Thibaut
>
>
>
> Miles Osborne wrote:
>>
>> just re-represent the associated data as a bit vector and set of hash
>> functions.  you then just copy this around, rather than the raw items
>> themselves.
>>
>> Miles
>>
>> 2009/2/18 Thibaut_ <tb...@blue.lu>:
>>>
>>> Hi,
>>>
>>> The bloomfilter solution works great, but I still have to copy the data
>>> around sometimes.
>>>
>>> I'm still wondering if I can reduce the associated data to the keys to a
>>> reference or something small (the >100 KB of data are very big), with
>>> which
>>> I can then later fetch the data in the reduce step.
>>>
>>> In the past I was using hbase to store the associated data in it (but
>>> unfortunately hbase proved to be very unreliable in my case). I will
>>> probably also start to compress the data in the value store, which will
>>> probably increase sorting speed (as the data is there probably
>>> uncompressed).
>>> Is there something else I could do to speed this process up?
>>>
>>> Thanks,
>>> Thibaut
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Finding small subset in very large dataset

Posted by Thibaut_ <tb...@blue.lu>.

Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be >100GB big
and is on the DFS)?

I could also directly save the offset of the data in the datafile as
reference, and then on each reducer read in that big file only once. As all
the keys are sorted, I can get all the needed values in one big read step
(skipping those entries I don't need).


Thibaut



Miles Osborne wrote:
> 
> just re-represent the associated data as a bit vector and set of hash
> functions.  you then just copy this around, rather than the raw items
> themselves.
> 
> Miles
> 
> 2009/2/18 Thibaut_ <tb...@blue.lu>:
>>
>> Hi,
>>
>> The bloomfilter solution works great, but I still have to copy the data
>> around sometimes.
>>
>> I'm still wondering if I can reduce the associated data to the keys to a
>> reference or something small (the >100 KB of data are very big), with
>> which
>> I can then later fetch the data in the reduce step.
>>
>> In the past I was using hbase to store the associated data in it (but
>> unfortunately hbase proved to be very unreliable in my case). I will
>> probably also start to compress the data in the value store, which will
>> probably increase sorting speed (as the data is there probably
>> uncompressed).
>> Is there something else I could do to speed this process up?
>>
>> Thanks,
>> Thibaut
>> --
>> View this message in context:
>> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 

-- 
View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

just re-represent the associated data as a bit vector and set of hash
functions.  you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ <tb...@blue.lu>:
>
> Hi,
>
> The bloomfilter solution works great, but I still have to copy the data
> around sometimes.
>
> I'm still wondering if I can reduce the associated data to the keys to a
> reference or something small (the >100 KB of data are very big), with which
> I can then later fetch the data in the reduce step.
>
> In the past I was using hbase to store the associated data in it (but
> unfortunately hbase proved to be very unreliable in my case). I will
> probably also start to compress the data in the value store, which will
> probably increase sorting speed (as the data is there probably
> uncompressed).
> Is there something else I could do to speed this process up?
>
> Thanks,
> Thibaut
> --
> View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Finding small subset in very large dataset

Posted by Thibaut_ <tb...@blue.lu>.

Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the >100 KB of data are very big), with which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
-- 
View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

Bloom Filters are one of the greatest things ever, so it is nice to
see another application.

Remember that your filter may make mistakes  -you will see items that
are not in the set.  Also, instead of setting a single bit per item
(in the A set), set k distinct bits.

You can analytically work-out the best k for a given number of items
and for some amount of memory.  In practice, this usually boils-down
to k being 3 or so for a reasonable error rate.

Happy hunting

Miles

2009/2/12 Thibaut_ <tb...@blue.lu>:
>
> Thanks,
>
> I didn't think about the bloom filter variant. That's the solution I was
> looking for :-)
>
> Thibaut
> --
> View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Finding small subset in very large dataset

Posted by Thibaut_ <tb...@blue.lu>.

Thanks,

I didn't think about the bloom filter variant. That's the solution I was
looking for :-)

Thibaut
-- 
View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

Posted by Aaron Kimball <aa...@cloudera.com>.

I don't see why a HAR archive needs to be involved. You can use a MapFile to
create a scannable index over a SequenceFile and do lookups that way.

But if A is small enough to fit in RAM, then there is a much simpler way:
Write it out to a file and disseminate to all mappers via the
DistributedCache. They then reach read in the entire A set into a HashSet or
other data structure during configure(), before they scan through their
slices of B. They then emit only the B values which hit in A. This is called
a "map-side join." If you don't care about sorted ordering of your results,
you can then disable the reducers entirely.

Hive already supports this behavior; but you have to explicitly tell it to
enable map-side joins for each query because only you know that one data set
is "small enough" ahead of time.

If your A set doesn't fit in RAM, you'll need to get more creative. One
possibility is to do the same thing as above, but instead of reading all of
A into memory, use a hash function to squash the keys from A into some
bounded amount of RAM. For example, allocate yourself a 256 MB bitvector;
for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for
each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1,
then it may match an A key; if it's 0 then it definitely does not match an A
key. For each potential match, send it to the reducer. Send all the A keys
to the reducer as well, where the  precise joining will occur. (Note: this
is effectively the same thing as a Bloom Filter.)

This will send much less data to each reducer and should see better
throughput.

- Aaron

On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel <am...@gmail.com> wrote:

> Are the keys in collection B unique?
>
> If so, I would like to try this approach:
> For each <key, value> of collection B, make a file out of it with file name
> given by MD5 hash of the "key", and "value" being its content, and then
> store all these files into a HAR archive.
> The HAR archive will create an index for you over the keys.
> Now you can iterate over the collection A, get the MD5 hash of the key, and
> look up in the archive for the file (to get the value).
>
> On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ <tb...@blue.lu> wrote:
>
> >
> > Hi,
> >
> > Let's say the smaller subset has name A. It is a relatively small
> > collection
> > < 100 000 entries (could also be only 100), with nearly no payload as
> > value.
> > Collection B is a big collection with >10 000 000 entries (Each key of A
> > also exists in the collection B), where the value for each key is
> > relatively
> > big (> 100 KB)
> >
> > For all the keys in A, I need to get the corresponding value from B and
> > collect it in the output.
> >
> >
> > - I can do this by reading in both files, and on the reduce step, do my
> > computations and collect only those which are both in A and B. The map
> > phase
> > however will take very long as all the key/value pairs of collection B
> need
> > to be sorted (and each key's value is >100 KB) at the end of the map
> phase,
> > which is overkill if A is very small.
> >
> > What I would need is an option to somehow make the intersection first
> > (Mapper only on keys, then a reduce functino based only on keys and not
> the
> > corresponding values which collects the keys I want to take), and then
> > running the map input and filtering the output collector or the input
> based
> > on the results from the reduce phase.
> >
> > Or is there another faster way? Collection A could be so big that it
> > doesn't
> > fit into the memory. I could split collection A up into multiple smaller
> > collections, but that would make it more complicated, so I want to evade
> > that route. (This is similar to the approach I described above, just a
> > manual approach)
> >
> > Thanks,
> > Thibaut
> > --
> > View this message in context:
> >
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: Finding small subset in very large dataset

Posted by Amit Chandel <am...@gmail.com>.

Are the keys in collection B unique?

If so, I would like to try this approach:
For each <key, value> of collection B, make a file out of it with file name
given by MD5 hash of the "key", and "value" being its content, and then
store all these files into a HAR archive.
The HAR archive will create an index for you over the keys.
Now you can iterate over the collection A, get the MD5 hash of the key, and
look up in the archive for the file (to get the value).

On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ <tb...@blue.lu> wrote:

>
> Hi,
>
> Let's say the smaller subset has name A. It is a relatively small
> collection
> < 100 000 entries (could also be only 100), with nearly no payload as
> value.
> Collection B is a big collection with >10 000 000 entries (Each key of A
> also exists in the collection B), where the value for each key is
> relatively
> big (> 100 KB)
>
> For all the keys in A, I need to get the corresponding value from B and
> collect it in the output.
>
>
> - I can do this by reading in both files, and on the reduce step, do my
> computations and collect only those which are both in A and B. The map
> phase
> however will take very long as all the key/value pairs of collection B need
> to be sorted (and each key's value is >100 KB) at the end of the map phase,
> which is overkill if A is very small.
>
> What I would need is an option to somehow make the intersection first
> (Mapper only on keys, then a reduce functino based only on keys and not the
> corresponding values which collects the keys I want to take), and then
> running the map input and filtering the output collector or the input based
> on the results from the reduce phase.
>
> Or is there another faster way? Collection A could be so big that it
> doesn't
> fit into the memory. I could split collection A up into multiple smaller
> collections, but that would make it more complicated, so I want to evade
> that route. (This is similar to the approach I described above, just a
> manual approach)
>
> Thanks,
> Thibaut
> --
> View this message in context:
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>