You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2009/08/21 02:50:14 UTC

[jira] Created: (SOLR-1375) BloomFilter on a field

BloomFilter on a field
----------------------

Key: SOLR-1375
URL: https://issues.apache.org/jira/browse/SOLR-1375
Project: Solr
Issue Type: New Feature
Components: update
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 1.5

* A bloom filter is a read only probabilistic set. Its useful
for verifying a key exists in a set, though it returns false
positives. http://en.wikipedia.org/wiki/Bloom_filter

* The use case is indexing in Hadoop and checking for duplicates
against a Solr cluster (which when using term dictionary or a
query) is too slow and exceeds the time consumed for indexing.
When a match is found, the host, segment, and term are returned.
If the same term is found on multiple servers, multiple results
are returned by the distributed process. (We'll need to add in
the core name I just realized).

* When new segments are created, and commit is called, a new
bloom filter is generated from a given field (default:id) by
iterating over the term dictionary values. There's a bloom
filter file per segment, which is managed on each Solr shard.
When segments are merged away, their corresponding .blm files is
also removed. In a future version we'll have a central server
for the bloom filters so we're not abusing the thread pool of
the Solr proxy and the networking of the Solr cluster (this will
be done sooner than later after testing this version). I held
off because the central server requires syncing the Solr
servers' files (which is like reverse replication).

* The patch uses the BloomFilter from Hadoop 0.20. I want to jar
up only the necessary classes so we don't have a giant Hadoop
jar in lib.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html

* Distributed code is added and seems to work, I extended
TestDistributedSearch to test over multiple HTTP servers. I
chose this approach rather than the manual method used by (for
example) TermVectorComponent.testDistributed because I'm new to
Solr's distributed search and wanted to learn how it works (the
stages are confusing). Using this method, I didn't need to setup
multiple tomcat servers and manually execute tests.

* We need more of the bloom filter options passable via
solrconfig

* I'll add more test cases

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846139#action_12846139 ] 

Otis Gospodnetic commented on SOLR-1375:
----------------------------------------

Heh, with the Lucene/Solr merge taking place now, my previous comment above makes even more sense.  What do you think?

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747734#action_12747734 ] 

Jason Rutherglen commented on SOLR-1375:
----------------------------------------

The other attribute to add is the ability to set the hash function to use (i.e. Murmur)

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746591#action_12746591 ] 

Andrzej Bialecki  commented on SOLR-1375:
-----------------------------------------

See here for a Java impl. of FastBits: http://code.google.com/p/compressedbitset/ .

Re: BloomFilters - in BloomIndexComponent you seem to assume that when BloomKeySet.contains(key) returns true then a key exists in a set. This is not strictly speaking true. You can only be sure with 1.0 probability that a key does NOT exist in a set, for other key when the result is true you only have a (1.0 - eps) probability that the answer is correct, i.e. the BloomFilter will return a false positive result for non-existent keys, with (eps) probability. You should take this into account when writing client code.

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745800#action_12745800 ] 

Jason Rutherglen commented on SOLR-1375:
----------------------------------------

Lance,

Thanks for the info. BloomFilters can be ORed together. Hadoop
uses BFs for map side joins, which is similar to this use case.

Centralizing will help when performing millions of id membership
tests, though I'm going to benchmark first and see if the
current patch is good enough.

-J

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1375) BloomFilter on a field

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745774#action_12745774 ] 

Lance Norskog edited comment on SOLR-1375 at 8/20/09 8:04 PM:
--------------------------------------------------------------

At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )

We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. 

I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. 

This project: [FastBits IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but who cares.






      was (Author: lancenorskog):
    At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )
We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. 
I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. 
This project: [FastBits IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but who cares.





  
> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846166#action_12846166 ] 

Ted Dunning commented on SOLR-1375:
-----------------------------------

Sorry to comment late here, but when indexing in hadoop, it is really nice to avoid any central dependence.  It is also nice to focus the map-side join on items likely to match.  Thirdly, reduce side indexing is typically really important.

The conclusions from these three considerations vary by duplication rate.  Using reduce-side indexing gets rid of most of the problems of duplicate versions of a single document (with the same sort key) since the reducer can scan to see whether it has the final copy handy before adding a document to the index.

There remain problems where we have to not index documents that already exist in the index or to generate a deletion list that can assist in applying the index update.  The former problem is usually the more severe one because it isn't unusual for data sources to just include a full dump of all documents and assume that the consumer will figure out which are new or updated.  Here you would like to only index new and modified documents.  

My own preference for this is to avoid the complication of the map-side join using Bloom filters and simply export a very simple list of stub documents that correspond to the documents in the index.  These stub documents should be much smaller than the average document (unless you are indexing tweets) which makes passing around great masses of stub documents not such a problem since Hadoop shuffle, copy and sort times are all dominated by Lucene index times.  Passing stub documents allows the reducer to simply iterate through all documents with the same key keeping the latest version or any stub that is encountered.  For documents without a stub, normal indexing can be done with the slight addition exporting a list of stub documents for the new additions.

The same thing could be done with a map-side join, but the trade-off is that you now need considerably more memory for the mapper to store the entire bitmap in memory as opposed needing (somewhat) more time to pass the stub documents around.  How that trade-off plays out in the real world isn't clear.  My personal preference is to keep heap space small since the time cost is pretty minimal for me.

This problem also turns up in our PDF conversion pipeline where we keep check-sums of each PDF that has already been converted to viewable forms.   In that case, the ratio of real document size to stub size is even more preponderate.


> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1375) BloomFilter on a field

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745774#action_12745774 ] 

Lance Norskog edited comment on SOLR-1375 at 8/20/09 8:08 PM:
--------------------------------------------------------------

At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )

We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. 

I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. 

This project: [FastBits IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but there is no Java and the C is a different license.

I would counsel against making a central server ; Solr technologies should be distributed and localized (close to the Solr instance) as possible.






      was (Author: lancenorskog):
    At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )

We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. 

I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. 

This project: [FastBits IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but who cares.





  
> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851637#action_12851637 ] 

Jason Rutherglen commented on SOLR-1375:
----------------------------------------

{quote}Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?{quote}

Adding this to Lucene is out of the scope of what I require, however I don't have time unless it's going to be committed.

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated SOLR-1375:
-----------------------------------

    Attachment: SOLR-1375.patch

* Removes the need to set the shards.qt in solrconfig.xml.  

* The test case verifies the correct .blm files are created

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated SOLR-1375:
-----------------------------------

    Attachment: SOLR-1375.patch

Patch file

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745774#action_12745774 ] 

Lance Norskog commented on SOLR-1375:
-------------------------------------

At my previous job, we were attempting to add the same document up to 100x per day. We used an MD5 signature for the id and made a bitmap file to pre-check ids before attempting to add them. Because we did not created a bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false positive problem which we were willing to put up with. (It was ok if we did not add all documents. )
We also had the advantage that different feeding machines pulled documents from different sources, and so machine A's set of repeated documents was separate from machine B's. Therefore, each could keep its own bitmap file and the files could be OR'd together periodically in background. 
I can't recommend what we did. If you like the Bloom Filter for this problem, that's great. 
This project: [FastBits IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be super-smart about compressing bits in a disk archive. It might be a better technology than the Nutch Bloom Filter, but who cares.






> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838446#action_12838446 ] 

Otis Gospodnetic commented on SOLR-1375:
----------------------------------------

{quote}
When new segments are created, and commit is called, a new
bloom filter is generated from a given field (default:id) by
iterating over the term dictionary values. There's a bloom
filter file per segment, which is managed on each Solr shard.
When segments are merged away, their corresponding .blm files is
also removed. 
{quote}

Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?


> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1375) BloomFilter on a field

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated SOLR-1375:
-----------------------------------

    Attachment: SOLR-1375.patch

Patch which allows shards to be set in defaults, and not cause an infinite loop if the shard is the same as the host.

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.