You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2009/08/21 02:52:14 UTC

[jira] Updated: (SOLR-1375) BloomFilter on a field

     [ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated SOLR-1375:
-----------------------------------

    Attachment: SOLR-1375.patch

Patch file

> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.