You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Boris Aleksandrovsky <ba...@gmail.com> on 2011/03/09 17:02:49 UTC

Bloomier filters

Does anyone know of Java implementation of Bloomier filters (essentially
Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)? I would
like to use it to efficients store language models (ngram to count
association map). It is probably not that hard to implement, but I was
wondering if there is anything out there?

Thanks,
Boris

Re: Bloomier filters

Posted by Ted Dunning <te...@gmail.com>.
Usually what I do when faced with this kind of structure is to

a) prune the data

b) reduce it to an int->int map or a long->int map using some kind of hash
on the original key

c) store the map in a hybrid data structure consisting of an ordered list of
keys and corresponding values plus a more general
hash map.  Searching the ordered list can use accelerated binary search
because we know the keys are uniformly distributed.
Insertions go against the hash map until we get a bunch of data at which
point, I sort the hashed values and merge into the
big list.

This is essentially equivalent to the way that hbase stores data but in
memory.  Using these techniques, I can get a very low
error rate, very simple code, high speed and very small memory foot-print.
 A few GB of memory can store nearly a billion
entries.

On Wed, Mar 9, 2011 at 8:02 AM, Boris Aleksandrovsky <ba...@gmail.com>wrote:

> Does anyone know of Java implementation of Bloomier filters (essentially
> Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)? I
> would
> like to use it to efficients store language models (ngram to count
> association map). It is probably not that hard to implement, but I was
> wondering if there is anything out there?
>
> Thanks,
> Boris
>

Re: Bloomier filters

Posted by Ted Dunning <te...@gmail.com>.
Start with Ken's and measure the impact.  If you don't need the next level
of optimization, then you are done.  Last time I did it, it took me several
hours to get the hybrid approach to work just right.  You might as well
avoid that time.

On Wed, Mar 9, 2011 at 9:55 AM, Boris Aleksandrovsky <ba...@gmail.com>wrote:

> Thanks, Ted and Ken. I think I will try Ted's algorithm (with Ken's
> suggestion for hash table implementation).
>
>
> On Wed, Mar 9, 2011 at 9:40 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> This is a good approach.  You can gain almost a factor of two if you used
>> the hybrid technique I mentioned because the ordered table
>> requires no extra storage.  If you want to go crazy, you can use delta
>> encoding, compressed integers and skip-lists to get another factor of
>> two to four.
>>
>> My preferred method, however, is to use my credit card and get more
>> memory.
>>
>> On Wed, Mar 9, 2011 at 9:32 AM, Ken Krugler <kkrugler_lists@transpac.com
>> >wrote:
>>
>> >
>> > On Mar 9, 2011, at 8:02am, Boris Aleksandrovsky wrote:
>> >
>> >  Does anyone know of Java implementation of Bloomier filters
>> (essentially
>> >> Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)? I
>> >> would
>> >> like to use it to efficients store language models (ngram to count
>> >> association map). It is probably not that hard to implement, but I was
>> >> wondering if there is anything out there?
>> >>
>> >
>> > We often generate a 64-bit JOAAT hash from the string, then use the
>> native
>> > long->int hashmap support in fastutil (http://fastutil.dsi.unimi.it/)
>> >
>> > -- Ken
>> >
>> > --------------------------
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://bixolabs.com
>> > e l a s t i c   w e b   m i n i n g
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Re: Bloomier filters

Posted by Boris Aleksandrovsky <ba...@gmail.com>.
Thanks, Ted and Ken. I think I will try Ted's algorithm (with Ken's
suggestion for hash table implementation).

On Wed, Mar 9, 2011 at 9:40 AM, Ted Dunning <te...@gmail.com> wrote:

> This is a good approach.  You can gain almost a factor of two if you used
> the hybrid technique I mentioned because the ordered table
> requires no extra storage.  If you want to go crazy, you can use delta
> encoding, compressed integers and skip-lists to get another factor of
> two to four.
>
> My preferred method, however, is to use my credit card and get more memory.
>
> On Wed, Mar 9, 2011 at 9:32 AM, Ken Krugler <kkrugler_lists@transpac.com
> >wrote:
>
> >
> > On Mar 9, 2011, at 8:02am, Boris Aleksandrovsky wrote:
> >
> >  Does anyone know of Java implementation of Bloomier filters (essentially
> >> Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)? I
> >> would
> >> like to use it to efficients store language models (ngram to count
> >> association map). It is probably not that hard to implement, but I was
> >> wondering if there is anything out there?
> >>
> >
> > We often generate a 64-bit JOAAT hash from the string, then use the
> native
> > long->int hashmap support in fastutil (http://fastutil.dsi.unimi.it/)
> >
> > -- Ken
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
> >
>

Re: Bloomier filters

Posted by Ted Dunning <te...@gmail.com>.
This is a good approach.  You can gain almost a factor of two if you used
the hybrid technique I mentioned because the ordered table
requires no extra storage.  If you want to go crazy, you can use delta
encoding, compressed integers and skip-lists to get another factor of
two to four.

My preferred method, however, is to use my credit card and get more memory.

On Wed, Mar 9, 2011 at 9:32 AM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Mar 9, 2011, at 8:02am, Boris Aleksandrovsky wrote:
>
>  Does anyone know of Java implementation of Bloomier filters (essentially
>> Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)? I
>> would
>> like to use it to efficients store language models (ngram to count
>> association map). It is probably not that hard to implement, but I was
>> wondering if there is anything out there?
>>
>
> We often generate a 64-bit JOAAT hash from the string, then use the native
> long->int hashmap support in fastutil (http://fastutil.dsi.unimi.it/)
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: Bloomier filters

Posted by Ken Krugler <kk...@transpac.com>.
On Mar 9, 2011, at 8:02am, Boris Aleksandrovsky wrote:

> Does anyone know of Java implementation of Bloomier filters  
> (essentially
> Bloom map, see http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)?  
> I would
> like to use it to efficients store language models (ngram to count
> association map). It is probably not that hard to implement, but I was
> wondering if there is anything out there?

We often generate a 64-bit JOAAT hash from the string, then use the  
native long->int hashmap support in fastutil (http://fastutil.dsi.unimi.it/ 
)

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g