You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2014/07/28 13:17:53 UTC

Bloom filter

Hi

Where can I find documentation on how to use Bloom filters in Solr 
(4.4). http://wiki.apache.org/solr/BloomIndexComponent seems to be 
outdated - there is no BloomIndexComponent included in 4.4 code.

Regards, Per Steffensen

Re: Bloom filter

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
I don't think that issue was ever committed.


On Mon, Jul 28, 2014 at 4:47 PM, Per Steffensen <st...@designware.dk> wrote:

> Hi
>
> Where can I find documentation on how to use Bloom filters in Solr (4.4).
> http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
> there is no BloomIndexComponent included in 4.4 code.
>
> Regards, Per Steffensen
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Bloom filter

Posted by Per Steffensen <st...@designware.dk>.
I just finished adding support for persisted ("backed" as I call them) 
bloom-filters in Guava Bloom Filter. Implemented one kind of persisted 
bloom-filter that works on memory mapped files.
I have changed our Solr code so that it uses such a enhanced Guava Bloom 
Filter. Making sure it is kept up to date and using it when quick "does 
definitely not exist checks" will help performance.

We do duplicate check also, because we also might get the "same" data 
from our external provider numerous times. We do it using unique-id 
feature in Solr where we make sure that if and only if (in practice) a 
two documents are "the same" they have the same id. We encode most info 
on the document in its id - including hashes of textual fields. Works 
like a charm. It is exactly in this case we want to improve performance. 
Most of the time a document does not already exist when we do this 
duplicate check (using the unique-id feature), but it just takes 
relatively long time to verify it, because you have to visit the index. 
We can get a quick "document with this id does not exist" using 
bloom-filter on id.

Regards, Per Steffensen

On 03/08/14 03:58, Umesh Prasad wrote:
> +1 to Guava's BloomFilter implementation.
>
> You can actually hook into UpdateProcessor chain and have the logic of
> updating bloom filter / checking there.
>
> We had a somewhat similar use case.  We were using DIH and it was possible
> that same solr input document (meaning same content) will be coming lots of
> times and it was leading to a lot of unnecessary updates in index. I
> introduced a DuplicateDetector using update processor chain which kept a
> map of Unique ID --> solr doc hash code and will drop the document if it
> was a duplicate.
>
> There is a nice video of other usage of Update chain
>
> https://www.youtube.com/watch?v=qoq2QEPHefo


Re: Bloom filter

Posted by Umesh Prasad <um...@gmail.com>.
+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo






On 30 July 2014 23:05, Shalin Shekhar Mangar <sh...@gmail.com> wrote:

> You're right. I misunderstood. I thought that you wanted to optimize the
> "finding by id" path which is typically done for comparing versions during
> inserts in Solr.
>
> Yes, it won't help with the case where the ID does not exist.
>
>
> On Wed, Jul 30, 2014 at 6:14 PM, Per Steffensen <st...@designware.dk>
> wrote:
>
> > Hi
> >
> > I am not sure exactly what LUCENE-5675 does, but reading the description
> > it seems to me that it would help finding out that there is no document
> > (having an id-field) where version-field is less than <some-version>. As
> > far as I can see this will not help finding out if a document with
> > id=<some-id> exists. We want to ask "does a document with id <some-id>
> > exist", without knowing the value of its version-field (if it actually
> > exists). You do not know if it ever existed, either.
> >
> > Please elaborate. Thanks!
> >
> > Regarding " The only other choice today is bloom filters, which use up
> > huge amounts of memory", I guess a bloom filter only takes as much space
> > (disk or memory) as you want it to. The more space you allow it to use
> the
> > more it gives you a false positive (saying "this doc might exist" in
> cases
> > where the doc actually does not exist). So the space you need to use for
> > the bloom filter depends on how frequently you can live with false
> > positives (where you have to actually look it up in the real index).
> >
> > Regards, Per Steffensen
> >
> >
> > On 30/07/14 10:05, Shalin Shekhar Mangar wrote:
> >
> >> Hi Per,
> >>
> >> There's LUCENE-5675 which has added a new postings format for IDs.
> Trying
> >> it out in Solr is in my todo list but maybe you can get to it before me.
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-5675
> >>
> >>
> >> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
> >> wrote:
> >>
> >>  On 30/07/14 08:55, jim ferenczi wrote:
> >>>
> >>>  Hi Per,
> >>>> First of all the BloomFilter implementation in Lucene is not exactly a
> >>>> bloom filter. It uses only one hash function and you cannot set the
> >>>> false
> >>>> positive ratio beforehand. ElasticSearch has its own bloom filter
> >>>> implementation (using "guava like" BloomFilter), you should take a
> look
> >>>> at
> >>>> their implementation if you really need this feature.
> >>>>
> >>>>  Yes, I am looking into what Lucene can do and how to use it through
> >>> Solr.
> >>> If it does not fit our needs I will enhance it - potentially with
> >>> inspiration from ES implementation. Thanks
> >>>
> >>>   What is your use-case ? If your index fits in RAM the bloom filter
> >>> won't
> >>>
> >>>> help (and it may have a negative impact if you have a lot of
> segments).
> >>>> In
> >>>> fact the only use case where the bloom filter can help is when your
> term
> >>>> dictionary does not fit in RAM which is rarely the case.
> >>>>
> >>>>  We have so many documents that it will never fit in memory. We use
> >>> optimistic locking (our own implementation) to do correct concurrent
> >>> assembly of documents and to do duplicate control. This require a lot
> of
> >>> finding docs from their id, and most of the time the document is not
> >>> there,
> >>> but to be sure we need to check both transactionlog and the actual
> index
> >>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
> >>> document with a particular id is NOT present.
> >>>
> >>>  Regards,
> >>>> Jim
> >>>>
> >>>>  Regards, Per Steffensen
> >>>
> >>>
> >>
> >>
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
---
Thanks & Regards
Umesh Prasad

Re: Bloom filter

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
You're right. I misunderstood. I thought that you wanted to optimize the
"finding by id" path which is typically done for comparing versions during
inserts in Solr.

Yes, it won't help with the case where the ID does not exist.


On Wed, Jul 30, 2014 at 6:14 PM, Per Steffensen <st...@designware.dk> wrote:

> Hi
>
> I am not sure exactly what LUCENE-5675 does, but reading the description
> it seems to me that it would help finding out that there is no document
> (having an id-field) where version-field is less than <some-version>. As
> far as I can see this will not help finding out if a document with
> id=<some-id> exists. We want to ask "does a document with id <some-id>
> exist", without knowing the value of its version-field (if it actually
> exists). You do not know if it ever existed, either.
>
> Please elaborate. Thanks!
>
> Regarding " The only other choice today is bloom filters, which use up
> huge amounts of memory", I guess a bloom filter only takes as much space
> (disk or memory) as you want it to. The more space you allow it to use the
> more it gives you a false positive (saying "this doc might exist" in cases
> where the doc actually does not exist). So the space you need to use for
> the bloom filter depends on how frequently you can live with false
> positives (where you have to actually look it up in the real index).
>
> Regards, Per Steffensen
>
>
> On 30/07/14 10:05, Shalin Shekhar Mangar wrote:
>
>> Hi Per,
>>
>> There's LUCENE-5675 which has added a new postings format for IDs. Trying
>> it out in Solr is in my todo list but maybe you can get to it before me.
>>
>> https://issues.apache.org/jira/browse/LUCENE-5675
>>
>>
>> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
>> wrote:
>>
>>  On 30/07/14 08:55, jim ferenczi wrote:
>>>
>>>  Hi Per,
>>>> First of all the BloomFilter implementation in Lucene is not exactly a
>>>> bloom filter. It uses only one hash function and you cannot set the
>>>> false
>>>> positive ratio beforehand. ElasticSearch has its own bloom filter
>>>> implementation (using "guava like" BloomFilter), you should take a look
>>>> at
>>>> their implementation if you really need this feature.
>>>>
>>>>  Yes, I am looking into what Lucene can do and how to use it through
>>> Solr.
>>> If it does not fit our needs I will enhance it - potentially with
>>> inspiration from ES implementation. Thanks
>>>
>>>   What is your use-case ? If your index fits in RAM the bloom filter
>>> won't
>>>
>>>> help (and it may have a negative impact if you have a lot of segments).
>>>> In
>>>> fact the only use case where the bloom filter can help is when your term
>>>> dictionary does not fit in RAM which is rarely the case.
>>>>
>>>>  We have so many documents that it will never fit in memory. We use
>>> optimistic locking (our own implementation) to do correct concurrent
>>> assembly of documents and to do duplicate control. This require a lot of
>>> finding docs from their id, and most of the time the document is not
>>> there,
>>> but to be sure we need to check both transactionlog and the actual index
>>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
>>> document with a particular id is NOT present.
>>>
>>>  Regards,
>>>> Jim
>>>>
>>>>  Regards, Per Steffensen
>>>
>>>
>>
>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Bloom filter

Posted by Per Steffensen <st...@designware.dk>.
Hi

I am not sure exactly what LUCENE-5675 does, but reading the description 
it seems to me that it would help finding out that there is no document 
(having an id-field) where version-field is less than <some-version>. As 
far as I can see this will not help finding out if a document with 
id=<some-id> exists. We want to ask "does a document with id <some-id> 
exist", without knowing the value of its version-field (if it actually 
exists). You do not know if it ever existed, either.

Please elaborate. Thanks!

Regarding " The only other choice today is bloom filters, which use up 
huge amounts of memory", I guess a bloom filter only takes as much space 
(disk or memory) as you want it to. The more space you allow it to use 
the more it gives you a false positive (saying "this doc might exist" in 
cases where the doc actually does not exist). So the space you need to 
use for the bloom filter depends on how frequently you can live with 
false positives (where you have to actually look it up in the real index).

Regards, Per Steffensen

On 30/07/14 10:05, Shalin Shekhar Mangar wrote:
> Hi Per,
>
> There's LUCENE-5675 which has added a new postings format for IDs. Trying
> it out in Solr is in my todo list but maybe you can get to it before me.
>
> https://issues.apache.org/jira/browse/LUCENE-5675
>
>
> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
> wrote:
>
>> On 30/07/14 08:55, jim ferenczi wrote:
>>
>>> Hi Per,
>>> First of all the BloomFilter implementation in Lucene is not exactly a
>>> bloom filter. It uses only one hash function and you cannot set the false
>>> positive ratio beforehand. ElasticSearch has its own bloom filter
>>> implementation (using "guava like" BloomFilter), you should take a look at
>>> their implementation if you really need this feature.
>>>
>> Yes, I am looking into what Lucene can do and how to use it through Solr.
>> If it does not fit our needs I will enhance it - potentially with
>> inspiration from ES implementation. Thanks
>>
>>   What is your use-case ? If your index fits in RAM the bloom filter won't
>>> help (and it may have a negative impact if you have a lot of segments). In
>>> fact the only use case where the bloom filter can help is when your term
>>> dictionary does not fit in RAM which is rarely the case.
>>>
>> We have so many documents that it will never fit in memory. We use
>> optimistic locking (our own implementation) to do correct concurrent
>> assembly of documents and to do duplicate control. This require a lot of
>> finding docs from their id, and most of the time the document is not there,
>> but to be sure we need to check both transactionlog and the actual index
>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
>> document with a particular id is NOT present.
>>
>>> Regards,
>>> Jim
>>>
>> Regards, Per Steffensen
>>
>
>


Re: Bloom filter

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
I opened https://issues.apache.org/jira/browse/SOLR-6301


On Wed, Jul 30, 2014 at 1:35 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Per,
>
> There's LUCENE-5675 which has added a new postings format for IDs. Trying
> it out in Solr is in my todo list but maybe you can get to it before me.
>
> https://issues.apache.org/jira/browse/LUCENE-5675
>
>
> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
> wrote:
>
>> On 30/07/14 08:55, jim ferenczi wrote:
>>
>>> Hi Per,
>>> First of all the BloomFilter implementation in Lucene is not exactly a
>>> bloom filter. It uses only one hash function and you cannot set the false
>>> positive ratio beforehand. ElasticSearch has its own bloom filter
>>> implementation (using "guava like" BloomFilter), you should take a look
>>> at
>>> their implementation if you really need this feature.
>>>
>> Yes, I am looking into what Lucene can do and how to use it through Solr.
>> If it does not fit our needs I will enhance it - potentially with
>> inspiration from ES implementation. Thanks
>>
>>  What is your use-case ? If your index fits in RAM the bloom filter won't
>>> help (and it may have a negative impact if you have a lot of segments).
>>> In
>>> fact the only use case where the bloom filter can help is when your term
>>> dictionary does not fit in RAM which is rarely the case.
>>>
>> We have so many documents that it will never fit in memory. We use
>> optimistic locking (our own implementation) to do correct concurrent
>> assembly of documents and to do duplicate control. This require a lot of
>> finding docs from their id, and most of the time the document is not there,
>> but to be sure we need to check both transactionlog and the actual index
>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
>> document with a particular id is NOT present.
>>
>>>
>>> Regards,
>>> Jim
>>>
>> Regards, Per Steffensen
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Bloom filter

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Hi Per,

There's LUCENE-5675 which has added a new postings format for IDs. Trying
it out in Solr is in my todo list but maybe you can get to it before me.

https://issues.apache.org/jira/browse/LUCENE-5675


On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
wrote:

> On 30/07/14 08:55, jim ferenczi wrote:
>
>> Hi Per,
>> First of all the BloomFilter implementation in Lucene is not exactly a
>> bloom filter. It uses only one hash function and you cannot set the false
>> positive ratio beforehand. ElasticSearch has its own bloom filter
>> implementation (using "guava like" BloomFilter), you should take a look at
>> their implementation if you really need this feature.
>>
> Yes, I am looking into what Lucene can do and how to use it through Solr.
> If it does not fit our needs I will enhance it - potentially with
> inspiration from ES implementation. Thanks
>
>  What is your use-case ? If your index fits in RAM the bloom filter won't
>> help (and it may have a negative impact if you have a lot of segments). In
>> fact the only use case where the bloom filter can help is when your term
>> dictionary does not fit in RAM which is rarely the case.
>>
> We have so many documents that it will never fit in memory. We use
> optimistic locking (our own implementation) to do correct concurrent
> assembly of documents and to do duplicate control. This require a lot of
> finding docs from their id, and most of the time the document is not there,
> but to be sure we need to check both transactionlog and the actual index
> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
> document with a particular id is NOT present.
>
>>
>> Regards,
>> Jim
>>
> Regards, Per Steffensen
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Bloom filter

Posted by Per Steffensen <st...@designware.dk>.
On 30/07/14 08:55, jim ferenczi wrote:
> Hi Per,
> First of all the BloomFilter implementation in Lucene is not exactly a
> bloom filter. It uses only one hash function and you cannot set the false
> positive ratio beforehand. ElasticSearch has its own bloom filter
> implementation (using "guava like" BloomFilter), you should take a look at
> their implementation if you really need this feature.
Yes, I am looking into what Lucene can do and how to use it through 
Solr. If it does not fit our needs I will enhance it - potentially with 
inspiration from ES implementation. Thanks
> What is your use-case ? If your index fits in RAM the bloom filter won't
> help (and it may have a negative impact if you have a lot of segments). In
> fact the only use case where the bloom filter can help is when your term
> dictionary does not fit in RAM which is rarely the case.
We have so many documents that it will never fit in memory. We use 
optimistic locking (our own implementation) to do correct concurrent 
assembly of documents and to do duplicate control. This require a lot of 
finding docs from their id, and most of the time the document is not 
there, but to be sure we need to check both transactionlog and the 
actual index (UpdateLog). We would like to use Bloom Filter to quickly 
tell that a document with a particular id is NOT present.
>
> Regards,
> Jim
Regards, Per Steffensen

Re: Bloom filter

Posted by jim ferenczi <ji...@gmail.com>.
Hi Per,
First of all the BloomFilter implementation in Lucene is not exactly a
bloom filter. It uses only one hash function and you cannot set the false
positive ratio beforehand. ElasticSearch has its own bloom filter
implementation (using "guava like" BloomFilter), you should take a look at
their implementation if you really need this feature.
What is your use-case ? If your index fits in RAM the bloom filter won't
help (and it may have a negative impact if you have a lot of segments). In
fact the only use case where the bloom filter can help is when your term
dictionary does not fit in RAM which is rarely the case.

Regards,
Jim



2014-07-28 16:13 GMT+02:00 Per Steffensen <st...@designware.dk>:

> Yes I found that one, along with SOLR-3950. Well at least it seems like
> the support is there in Lucene. I will figure out myself how to make it
> work via Solr, the way I need it to work. My use-case is not as specified
> in SOLR-1375, but the solution might be the same. Any input is of course
> still very much appreciated.
>
> Regards, Per Steffensen
>
>
> On 28/07/14 15:42, Lukas Drbal wrote:
>
>> Hi Per,
>>
>> link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved
>> ;-)
>>
>> L.
>>
>>
>> On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen <st...@designware.dk>
>> wrote:
>>
>>  Hi
>>>
>>> Where can I find documentation on how to use Bloom filters in Solr (4.4).
>>> http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
>>> there is no BloomIndexComponent included in 4.4 code.
>>>
>>> Regards, Per Steffensen
>>>
>>>
>>
>>
>

Re: Bloom filter

Posted by Per Steffensen <st...@designware.dk>.
Yes I found that one, along with SOLR-3950. Well at least it seems like 
the support is there in Lucene. I will figure out myself how to make it 
work via Solr, the way I need it to work. My use-case is not as 
specified in SOLR-1375, but the solution might be the same. Any input is 
of course still very much appreciated.

Regards, Per Steffensen

On 28/07/14 15:42, Lukas Drbal wrote:
> Hi Per,
>
> link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved
> ;-)
>
> L.
>
>
> On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen <st...@designware.dk> wrote:
>
>> Hi
>>
>> Where can I find documentation on how to use Bloom filters in Solr (4.4).
>> http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
>> there is no BloomIndexComponent included in 4.4 code.
>>
>> Regards, Per Steffensen
>>
>
>


Re: Bloom filter

Posted by Lukas Drbal <lu...@socialbakers.com>.
Hi Per,

link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved
;-)

L.


On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen <st...@designware.dk> wrote:

> Hi
>
> Where can I find documentation on how to use Bloom filters in Solr (4.4).
> http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
> there is no BloomIndexComponent included in 4.4 code.
>
> Regards, Per Steffensen
>



-- 


*Lukáš Drbal*
Software architect

*Socialbakers*
Facebook applications and other sweet stuff

Facebook Preferred Marketing Developer


*+420 739 815 424**lukas.drbal@socialbakers.com
<lu...@socialbakers.com>*
*www.socialbakers.com <http://www.socialbakers.com/>*