You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Norman Barker <no...@gmail.com> on 2010/09/25 01:54:49 UTC

multiview using bloom filters

Hi,

thanks to Paul's excellent suggestion I have rewritten the multiview
to use bloom filters, I had a concern that a bloom filter per view
would use too much memory but thanks in the main to excellent
implementation of bloom filters in erlang
(http://sites.google.com/site/scalablebloomfilters/) they seem to be
very space efficient.

New code is here

http://github.com/normanb/couchdb/

The code is simple, all one process, once we have agreed the approach
we can decide if there is any benefit in making the bloom filter
generation occur a separate process (using a genserver).

Comments as always appreciated, I will continue adding to the test suite.

thanks for the help,

Norman

Re: multiview using bloom filters

Posted by Paul Davis <pa...@gmail.com>.
I would say if you find it somewhere with an EPL header, then that's a
good sign. But best to check with the original author that it was his
intent. As an interesting aside, I'm not even sure if its possible
that someone that's not an employee of Ericson can release something
under the EPL.

I'm not being a pain on purpose here, but the ASF is particular about
provenance in code and licensing. Alternatively, I haven't studied the
code, so in the worst case I'll read the white paper and try and apply
it to Basho's ebloom project.

Paul Davis

On Fri, Sep 24, 2010 at 11:21 PM, Norman Barker <no...@gmail.com> wrote:
> Paul,
>
> yes, performance is actually much better (for some of our harder
> queries, so all docs over time with field X (two views), 10x faster),
> I am testing with docs that in total emit ~100K of keys (following the
> raindrop megaview).
>
> Some of the scalable bloom filter project contained EPL headers,
> others didn't, googling for the source code I had seen other projects
> add the EPL headers to bit array so I did the same. I will contact the
> author as he seems active on the erlang mailing lists and if not I
> will write a bloom filter from scratch, the theory is well documented,
> though I like his code!
>
> thanks for your help, let me know any suggestions you may have.
>
> thanks,
>
> Norman
>
>
>
> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis <pa...@gmail.com> wrote:
>> Norman,
>>
>> Just glanced through. Looks better. Any feeling for a performance differences?
>>
>> Also, I glanced at the original files that you linked to. The bit
>> array files didn't have a license, but what you've got there does have
>> EPL headers. We need to make sure we have permission to do so. I would
>> assume as much, but we have to be careful about such things in the
>> ASF. You only need to get an email from the original author saying its
>> ok.
>>
>> I'm a bit caught up with some other code at the moment, I'll give a
>> more thorough combing over tomorrow.
>>
>> Paul
>>
>> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
>>> Hi,
>>>
>>> thanks to Paul's excellent suggestion I have rewritten the multiview
>>> to use bloom filters, I had a concern that a bloom filter per view
>>> would use too much memory but thanks in the main to excellent
>>> implementation of bloom filters in erlang
>>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
>>> very space efficient.
>>>
>>> New code is here
>>>
>>> http://github.com/normanb/couchdb/
>>>
>>> The code is simple, all one process, once we have agreed the approach
>>> we can decide if there is any benefit in making the bloom filter
>>> generation occur a separate process (using a genserver).
>>>
>>> Comments as always appreciated, I will continue adding to the test suite.
>>>
>>> thanks for the help,
>>>
>>> Norman
>>>
>>
>

Re: multiview using bloom filters

Posted by Norman Barker <no...@gmail.com>.
I have added the formatting changes and contacted the author of
scalable bloom filters, it seems that bitarray (and the hipe version)
came from a discussion on the erlang mailing lists

http://groups.google.com/group/erlang-programming/browse_thread/thread/7c0191b1d709a5fe/ea5cf52b46d67d76?lnk=gst&q=bitarray#ea5cf52b46d67d76

but the author of the bloom.erl should be able to confirm.

Any other comments, anyone had a chance to test it out?!!

thanks,

Norman


On Sat, Sep 25, 2010 at 8:45 AM, Paul Davis <pa...@gmail.com> wrote:
> Although, I don't believe theirs is growable. But if it is, that might
> be interesting to test for speed. Or we could add the growable parts.
>
> On Sat, Sep 25, 2010 at 5:44 AM, Robert Dionne
> <di...@dionne-associates.com> wrote:
>> Norman,
>>
>>   Basho also has a bloom filter implementation packaged as a separate project[1], that you might find useful. It's used in Bitcask.
>>
>> Cheers,
>>
>> Bob
>>
>>
>>
>> [1] http://github.com/basho/ebloom
>>
>>
>>
>>
>> On Sep 24, 2010, at 11:21 PM, Norman Barker wrote:
>>
>>> Paul,
>>>
>>> yes, performance is actually much better (for some of our harder
>>> queries, so all docs over time with field X (two views), 10x faster),
>>> I am testing with docs that in total emit ~100K of keys (following the
>>> raindrop megaview).
>>>
>>> Some of the scalable bloom filter project contained EPL headers,
>>> others didn't, googling for the source code I had seen other projects
>>> add the EPL headers to bit array so I did the same. I will contact the
>>> author as he seems active on the erlang mailing lists and if not I
>>> will write a bloom filter from scratch, the theory is well documented,
>>> though I like his code!
>>>
>>> thanks for your help, let me know any suggestions you may have.
>>>
>>> thanks,
>>>
>>> Norman
>>>
>>>
>>>
>>> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis <pa...@gmail.com> wrote:
>>>> Norman,
>>>>
>>>> Just glanced through. Looks better. Any feeling for a performance differences?
>>>>
>>>> Also, I glanced at the original files that you linked to. The bit
>>>> array files didn't have a license, but what you've got there does have
>>>> EPL headers. We need to make sure we have permission to do so. I would
>>>> assume as much, but we have to be careful about such things in the
>>>> ASF. You only need to get an email from the original author saying its
>>>> ok.
>>>>
>>>> I'm a bit caught up with some other code at the moment, I'll give a
>>>> more thorough combing over tomorrow.
>>>>
>>>> Paul
>>>>
>>>> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> thanks to Paul's excellent suggestion I have rewritten the multiview
>>>>> to use bloom filters, I had a concern that a bloom filter per view
>>>>> would use too much memory but thanks in the main to excellent
>>>>> implementation of bloom filters in erlang
>>>>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
>>>>> very space efficient.
>>>>>
>>>>> New code is here
>>>>>
>>>>> http://github.com/normanb/couchdb/
>>>>>
>>>>> The code is simple, all one process, once we have agreed the approach
>>>>> we can decide if there is any benefit in making the bloom filter
>>>>> generation occur a separate process (using a genserver).
>>>>>
>>>>> Comments as always appreciated, I will continue adding to the test suite.
>>>>>
>>>>> thanks for the help,
>>>>>
>>>>> Norman
>>>>>
>>>>
>>
>>
>

Re: multiview using bloom filters

Posted by Paul Davis <pa...@gmail.com>.
Although, I don't believe theirs is growable. But if it is, that might
be interesting to test for speed. Or we could add the growable parts.

On Sat, Sep 25, 2010 at 5:44 AM, Robert Dionne
<di...@dionne-associates.com> wrote:
> Norman,
>
>   Basho also has a bloom filter implementation packaged as a separate project[1], that you might find useful. It's used in Bitcask.
>
> Cheers,
>
> Bob
>
>
>
> [1] http://github.com/basho/ebloom
>
>
>
>
> On Sep 24, 2010, at 11:21 PM, Norman Barker wrote:
>
>> Paul,
>>
>> yes, performance is actually much better (for some of our harder
>> queries, so all docs over time with field X (two views), 10x faster),
>> I am testing with docs that in total emit ~100K of keys (following the
>> raindrop megaview).
>>
>> Some of the scalable bloom filter project contained EPL headers,
>> others didn't, googling for the source code I had seen other projects
>> add the EPL headers to bit array so I did the same. I will contact the
>> author as he seems active on the erlang mailing lists and if not I
>> will write a bloom filter from scratch, the theory is well documented,
>> though I like his code!
>>
>> thanks for your help, let me know any suggestions you may have.
>>
>> thanks,
>>
>> Norman
>>
>>
>>
>> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis <pa...@gmail.com> wrote:
>>> Norman,
>>>
>>> Just glanced through. Looks better. Any feeling for a performance differences?
>>>
>>> Also, I glanced at the original files that you linked to. The bit
>>> array files didn't have a license, but what you've got there does have
>>> EPL headers. We need to make sure we have permission to do so. I would
>>> assume as much, but we have to be careful about such things in the
>>> ASF. You only need to get an email from the original author saying its
>>> ok.
>>>
>>> I'm a bit caught up with some other code at the moment, I'll give a
>>> more thorough combing over tomorrow.
>>>
>>> Paul
>>>
>>> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> thanks to Paul's excellent suggestion I have rewritten the multiview
>>>> to use bloom filters, I had a concern that a bloom filter per view
>>>> would use too much memory but thanks in the main to excellent
>>>> implementation of bloom filters in erlang
>>>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
>>>> very space efficient.
>>>>
>>>> New code is here
>>>>
>>>> http://github.com/normanb/couchdb/
>>>>
>>>> The code is simple, all one process, once we have agreed the approach
>>>> we can decide if there is any benefit in making the bloom filter
>>>> generation occur a separate process (using a genserver).
>>>>
>>>> Comments as always appreciated, I will continue adding to the test suite.
>>>>
>>>> thanks for the help,
>>>>
>>>> Norman
>>>>
>>>
>
>

Re: multiview using bloom filters

Posted by Robert Dionne <di...@dionne-associates.com>.
Norman,

   Basho also has a bloom filter implementation packaged as a separate project[1], that you might find useful. It's used in Bitcask.

Cheers,

Bob



[1] http://github.com/basho/ebloom




On Sep 24, 2010, at 11:21 PM, Norman Barker wrote:

> Paul,
> 
> yes, performance is actually much better (for some of our harder
> queries, so all docs over time with field X (two views), 10x faster),
> I am testing with docs that in total emit ~100K of keys (following the
> raindrop megaview).
> 
> Some of the scalable bloom filter project contained EPL headers,
> others didn't, googling for the source code I had seen other projects
> add the EPL headers to bit array so I did the same. I will contact the
> author as he seems active on the erlang mailing lists and if not I
> will write a bloom filter from scratch, the theory is well documented,
> though I like his code!
> 
> thanks for your help, let me know any suggestions you may have.
> 
> thanks,
> 
> Norman
> 
> 
> 
> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis <pa...@gmail.com> wrote:
>> Norman,
>> 
>> Just glanced through. Looks better. Any feeling for a performance differences?
>> 
>> Also, I glanced at the original files that you linked to. The bit
>> array files didn't have a license, but what you've got there does have
>> EPL headers. We need to make sure we have permission to do so. I would
>> assume as much, but we have to be careful about such things in the
>> ASF. You only need to get an email from the original author saying its
>> ok.
>> 
>> I'm a bit caught up with some other code at the moment, I'll give a
>> more thorough combing over tomorrow.
>> 
>> Paul
>> 
>> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
>>> Hi,
>>> 
>>> thanks to Paul's excellent suggestion I have rewritten the multiview
>>> to use bloom filters, I had a concern that a bloom filter per view
>>> would use too much memory but thanks in the main to excellent
>>> implementation of bloom filters in erlang
>>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
>>> very space efficient.
>>> 
>>> New code is here
>>> 
>>> http://github.com/normanb/couchdb/
>>> 
>>> The code is simple, all one process, once we have agreed the approach
>>> we can decide if there is any benefit in making the bloom filter
>>> generation occur a separate process (using a genserver).
>>> 
>>> Comments as always appreciated, I will continue adding to the test suite.
>>> 
>>> thanks for the help,
>>> 
>>> Norman
>>> 
>> 


Re: multiview using bloom filters

Posted by Norman Barker <no...@gmail.com>.
Paul,

yes, performance is actually much better (for some of our harder
queries, so all docs over time with field X (two views), 10x faster),
I am testing with docs that in total emit ~100K of keys (following the
raindrop megaview).

Some of the scalable bloom filter project contained EPL headers,
others didn't, googling for the source code I had seen other projects
add the EPL headers to bit array so I did the same. I will contact the
author as he seems active on the erlang mailing lists and if not I
will write a bloom filter from scratch, the theory is well documented,
though I like his code!

thanks for your help, let me know any suggestions you may have.

thanks,

Norman



On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis <pa...@gmail.com> wrote:
> Norman,
>
> Just glanced through. Looks better. Any feeling for a performance differences?
>
> Also, I glanced at the original files that you linked to. The bit
> array files didn't have a license, but what you've got there does have
> EPL headers. We need to make sure we have permission to do so. I would
> assume as much, but we have to be careful about such things in the
> ASF. You only need to get an email from the original author saying its
> ok.
>
> I'm a bit caught up with some other code at the moment, I'll give a
> more thorough combing over tomorrow.
>
> Paul
>
> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
>> Hi,
>>
>> thanks to Paul's excellent suggestion I have rewritten the multiview
>> to use bloom filters, I had a concern that a bloom filter per view
>> would use too much memory but thanks in the main to excellent
>> implementation of bloom filters in erlang
>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
>> very space efficient.
>>
>> New code is here
>>
>> http://github.com/normanb/couchdb/
>>
>> The code is simple, all one process, once we have agreed the approach
>> we can decide if there is any benefit in making the bloom filter
>> generation occur a separate process (using a genserver).
>>
>> Comments as always appreciated, I will continue adding to the test suite.
>>
>> thanks for the help,
>>
>> Norman
>>
>

Re: multiview using bloom filters

Posted by Paul Davis <pa...@gmail.com>.
Norman,

Just glanced through. Looks better. Any feeling for a performance differences?

Also, I glanced at the original files that you linked to. The bit
array files didn't have a license, but what you've got there does have
EPL headers. We need to make sure we have permission to do so. I would
assume as much, but we have to be careful about such things in the
ASF. You only need to get an email from the original author saying its
ok.

I'm a bit caught up with some other code at the moment, I'll give a
more thorough combing over tomorrow.

Paul

On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker <no...@gmail.com> wrote:
> Hi,
>
> thanks to Paul's excellent suggestion I have rewritten the multiview
> to use bloom filters, I had a concern that a bloom filter per view
> would use too much memory but thanks in the main to excellent
> implementation of bloom filters in erlang
> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
> very space efficient.
>
> New code is here
>
> http://github.com/normanb/couchdb/
>
> The code is simple, all one process, once we have agreed the approach
> we can decide if there is any benefit in making the bloom filter
> generation occur a separate process (using a genserver).
>
> Comments as always appreciated, I will continue adding to the test suite.
>
> thanks for the help,
>
> Norman
>

Re: multiview using bloom filters

Posted by Filipe David Manana <fd...@apache.org>.
Hi, good work.

I haven't had the time to look at it in detail, only glanced at it.
My comments:

- use indentation of 4 spaces - I see some places using indentation of
2 spaces, others using more, etc;
- most of CouchDB's code uses a different style of case expression
indentation - each case branch as the same indentation as the case
line, e.g.:

case lists:member(foobar, List) of
true ->
    whatever;
false ->
    else;
end

This avoids the code to grow too much horizontally (and more readable
in my opinion, but this is very subjective).

- keep lines to 80 characters at most
- instead of using list_to_binary and binary_to_list, you can use the
?l2b and ?b2l macros defined in couch_db.hrl, they help reducing line
length

cheers

On Sat, Sep 25, 2010 at 12:54 AM, Norman Barker <no...@gmail.com> wrote:
> Hi,
>
> thanks to Paul's excellent suggestion I have rewritten the multiview
> to use bloom filters, I had a concern that a bloom filter per view
> would use too much memory but thanks in the main to excellent
> implementation of bloom filters in erlang
> (http://sites.google.com/site/scalablebloomfilters/) they seem to be
> very space efficient.
>
> New code is here
>
> http://github.com/normanb/couchdb/
>
> The code is simple, all one process, once we have agreed the approach
> we can decide if there is any benefit in making the bloom filter
> generation occur a separate process (using a genserver).
>
> Comments as always appreciated, I will continue adding to the test suite.
>
> thanks for the help,
>
> Norman
>



-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."