You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Samarendra Pratap <sa...@gmail.com> on 2010/11/03 11:05:59 UTC

Single filter instance with different searchers

Hi. We have a large index (~ 28 GB) which is distributed in three different
directories, each representing a country. Each of these country wise indexes
is further distributed on the basis of last update date into 21 smaller
indexes. This index is updated once in a day.

A user can search into any of one country and can choose last update date
plus some other criteria.

When the server application starts, index readers and hence searchers are
created for each of the small indexes (21 x 3) and put in an array.
Depending on the option (country and last update date) chosen by user we
pick the searchers of correct date range/country and create a new
ParallelMultiSearcher instance.

Now my question is - can I use single filter (caching filter) instance for
every search (may be on different searchers)?
===================================================================================

e.g
for first search i create an filter of experience 4 years and save it.

if another search for a different country (and hence difference index) also
has same experience criteria, i.e. 4 years, can i use the same filter
instance for second search too?

i have tested a little for this and surprisingly i have got correct results.
i was wondering if this is the correct way. or do i need to create different
filters for each searcher (or index reader) instance?

Thanks in advance.

-- 
Regards,
Samar

Re: Single filter instance with different searchers

Posted by Samarendra Pratap <sa...@gmail.com>.

Thanks Erick for you insight.

I'd appreciate if someone could throw more light on it.

Thanks

On Tue, Nov 9, 2010 at 11:27 PM, Erick Erickson <er...@gmail.com>wrote:

> I'm going to have to leave answering that to people with more
> familiarity with the underlying code than I have...
>
> That said, I'd #guess# that you'll be OK because I'd #guess# that
> filters are maintained on a per-reader basis and the results
> are synthesized when combined in a MultiSearcher.
>
> But that's all a guess....
>
> Best
> Erick
>
> On Tue, Nov 9, 2010 at 2:48 AM, Samarendra Pratap <samarzone@gmail.com
> >wrote:
>
> > Thanks Erick, you cleared some of my confusions. But I still have a
> doubt.
> >
> >  As you can see in previous example code I am re-creating parallel multi
> > searcher for each search. (This is the actual scenario on production
> > servers)
> > The ParallelMultiSearcher constructor is taking different combination of
> > searchers each time. It means that the same document may be assigned a
> > different docid for next search.
> >
> > So my primary question is - Will the cached results from a filter created
> > with one multi searcher, work fine with another multi searcher?
> (Underlying
> > IndexSearchers are opened only once. It is combination of IndexSearchers
> > which is varying for each search.)
> >
> >  I have tested it with my real code and sample indexes and it gives me a
> > feeling that results are correct, but I am not able to understand how,
> > given
> > my above confusion.
> >
> > Can you suggest me a something with another curiosity - Which option will
> > be
> > more efficient - 1. MultiSearchers  (either recreating for each search or
> > reusing cached ones) with different searchers or 2. having a single index
> > for all last update date criteria and using filters for different
> > combinations of last update dates.
> >  As I wrote in my previous mail we have different physical indexes based
> on
> > different ranges of update dates. We select appropriate indexes based on
> > the
> > user selected options.
> >
> > On Tue, Nov 9, 2010 at 4:25 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Ignore my previous, I thought you were constructing your own filters.
> > What
> > > you're doing should
> > > be OK.
> > >
> > > Here's the source of my confusion.  Each of your indexes has Lucene
> > > document
> > > IDs starting at
> > > 0. In your example, you have two docs/index. So, if you created a
> Filter
> > > via
> > > lower-level
> > > calls, it could not be applied across different indexes. See the
> > discussion
> > > here:
> > > http://www.gossamer-threads.com/lists/lucene/java-user/106376. That
> is,
> > > the bit in your Filter for index0, doc0 would be the same bit as in
> > index1,
> > > doc0.
> > >
> > > But, that's not what you are doing. The (Parallel)MultiSearcher takes
> > > care of mapping these doc IDs appropriately for you so you don't have
> to
> > > worry about
> > > what I was thinking about. Here's a program that illustrates this. It
> > > creates
> > > three RAMDirectories then  dumps the Lucene doc ID from each. Then it
> > > creates
> > > a multisearcher from the same three dirs and walks that, dumping the
> > Lucene
> > > doc ID.
> > > You'll see that the doc IDs change even though the contents are the
> > > same....
> > >
> > > Again, though, this isn't a problem because you are using a
> > MultiSearcher,
> > > which
> > > takes care of this for you.
> > >
> > > Which is yet another reason to never, never, never count on lucene doc
> > IDs
> > > outside their context!
> > >
> > > Output at the end......
> > >
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.index.IndexWriter;
> > > import org.apache.lucene.search.*;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.RAMDirectory;
> > > import org.apache.lucene.util.Version;
> > >
> > > import java.io.IOException;
> > >
> > > import static org.apache.lucene.index.IndexWriter.*;
> > >
> > > public class EoeTest {
> > >    public static void main(String[] args) {
> > >        EoeTest eoe = new EoeTest();
> > >        eoe.doIt();
> > >    }
> > >    private void doIt() {
> > >        try {
> > >            populateIndexes();
> > >            searchAndSpit();
> > >            tryMulti();
> > >         } catch (Exception e) {
> > >            e.printStackTrace();
> > >        }
> > >
> > >    }
> > >
> > >     private Searcher getMulti() throws IOException {
> > >        IndexSearcher[] searchers = new IndexSearcher[3];
> > >        searchers[0] = new IndexSearcher(_ram1, true);
> > >        searchers[1] = new IndexSearcher(_ram2, true);
> > >        searchers[2] = new IndexSearcher(_ram3, true);
> > >        return new MultiSearcher(searchers);
> > >    }
> > >    private void tryMulti() throws IOException {
> > >        searchOne("multi ", getMulti());
> > >    }
> > >
> > >    private void searchAndSpit() throws IOException {
> > >        searchOne("ram1", new IndexSearcher(_ram1, true));
> > >        searchOne("ram2", new IndexSearcher(_ram2, true));
> > >        searchOne("ram3", new IndexSearcher(_ram3, true));
> > >    }
> > >    private void searchOne(String which, Searcher is) throws IOException
> {
> > >        log("dumping " + which);
> > >        TopDocs hits = is.search(new MatchAllDocsQuery(), 100);
> > >        for (int idx = 0; idx < hits.scoreDocs.length; ++idx) {
> > >            ScoreDoc sd = hits.scoreDocs[idx];
> > >            Document doc = is.doc(sd.doc);
> > >            log(String.format("lid: %d, content: %s", sd.doc,
> > > doc.get("content")));
> > >        }
> > >        is.close();
> > >    }
> > >    private void log(String msg) {
> > >        System.out.println(msg);
> > >    }
> > >    private void populateIndexes() throws IOException {
> > >        popOne(_ram1);
> > >        popOne(_ram2);
> > >        popOne(_ram3);
> > >    }
> > >
> > >    private void popOne(Directory dir) throws IOException {
> > >        IndexWriter iw = new IndexWriter(dir, _std,
> > MaxFieldLength.LIMITED);
> > >        Document doc = new Document();
> > >        doc.add(new Field("content", "common " +
> > > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > > Field.TermVector.YES));
> > >        iw.addDocument(doc);
> > >
> > >        doc = new Document();
> > >        doc.add(new Field("content", "common " +
> > > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > > Field.TermVector.YES));
> > >        iw.addDocument(doc);
> > >
> > >        iw.close();
> > >    }
> > >
> > >
> > >    Directory _ram1 = new RAMDirectory();
> > >    Directory _ram2 = new RAMDirectory();
> > >    Directory _ram3 = new RAMDirectory();
> > >    Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
> > > }
> > >
> > > ************************************output****************
> > > where lid: ### is the Lucene doc ID returned in scoreDocs
> > > ***********************************************************
> > >
> > > dumping ram1
> > > lid: 0, content: common 0.11100571422470962
> > > lid: 1, content: common 0.31555863707233567
> > > dumping ram2
> > > lid: 0, content: common 0.01235509997022377
> > > lid: 1, content: common 0.7017712652104814
> > > dumping ram3
> > > lid: 0, content: common 0.9472403989314128
> > > lid: 1, content: common 0.7105628402082196
> > > dumping multi
> > > lid: 0, content: common 0.11100571422470962
> > > lid: 1, content: common 0.31555863707233567
> > > lid: 2, content: common 0.01235509997022377
> > > lid: 3, content: common 0.7017712652104814
> > > lid: 4, content: common 0.9472403989314128
> > > lid: 5, content: common 0.7105628402082196
> > >
> > >
> > >
> > >
> > > On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap <samarzone@gmail.com
> > > >wrote:
> > >
> > > > Hi Erick, Thanks for the reply.
> > > >  Your answer have puzzled me more because what I am able to view is
> not
> > > > what you say or I am not able to grasp your meaning.
> > > >  I have written a small program which is exactly what my original
> > > question
> > > > was. Here I am creating a CachingWrapperFilter on one index and
> reusing
> > > it
> > > > on other indexes. This single filter gives me results as expected
> from
> > > each
> > > > of the index. I will appreciate if you can throw some light.
> > > >
> > > > I have given the output after the program ends
> > > >
> > > >
> > > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > // following program is compiled with java6
> > > >
> > > > import org.apache.lucene.index.*;
> > > > import org.apache.lucene.analysis.*;
> > > > import org.apache.lucene.analysis.standard.*;
> > > > import org.apache.lucene.search.*;
> > > > import org.apache.lucene.search.spans.*;
> > > > import org.apache.lucene.store.*;
> > > > import org.apache.lucene.document.*;
> > > > import org.apache.lucene.queryParser.*;
> > > > import org.apache.lucene.util.*;
> > > >
> > > > import java.util.*;
> > > >
> > > > public class FilterTest
> > > > {
> > > > protected Directory[] dirs;
> > > >  protected Analyzer a;
> > > > protected Searcher[] searchers;
> > > > protected QueryParser qp;
> > > >  protected Filter f;
> > > > protected Hashtable<String, Filter> filters;
> > > >
> > > >  public FilterTest()
> > > > {
> > > > // create analyzer
> > > >  a = new StandardAnalyzer(Version.LUCENE_29);
> > > > // create query parser
> > > > qp = new QueryParser(Version.LUCENE_29, "content", a);
> > > >  // initialize "filters" Hashtable
> > > > filters = new Hashtable<String, Filter>();
> > > >  }
> > > >
> > > > protected void createDirectories(int length)
> > > > {
> > > >  // create specified number of RAM directories
> > > > dirs = new Directory[length];
> > > >  for(int i=0;i<length;i++)
> > > > dirs[i] = new RAMDirectory();
> > > > }
> > > >
> > > > protected void createIndexes() throws Exception
> > > > {
> > > > /* create indexes for each directory.
> > > >  each index contains two documents.
> > > > every document contains one term, unique across all indexes, one term
> > > > unique across single index and one term common to all indexes
> > > >  */
> > > > for(int i=0;i<dirs.length;i++)
> > > > {
> > > >  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> > > > IndexWriter.MaxFieldLength.LIMITED);
> > > >
> > > > Document d = new Document();
> > > >  // unique id across all indexes
> > > > d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > >  // unique id in a single indexes
> > > > d.add(new Field("docnumber", "1", Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > >  // common word in all indexes
> > > > d.add(new Field("content", "common", Field.Store.YES,
> > > Field.Index.ANALYZED,
> > > > Field.TermVector.YES));
> > > >  iw.addDocument(d);
> > > >
> > > > d = new Document();
> > > > // unique id across all indexes
> > > >  d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > > // unique id in a single indexes
> > > >  d.add(new Field("docnumber", "2", Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > > // common word in all indexes
> > > >  d.add(new Field("content", "common", Field.Store.YES,
> > > > Field.Index.ANALYZED, Field.TermVector.YES));
> > > > iw.addDocument(d);
> > > >
> > > > iw.close();
> > > > }
> > > > }
> > > >
> > > > protected void openSearchers() throws Exception
> > > > {
> > > > // open searches for every directory and save it in an array
> > > >  searchers = new Searcher[dirs.length];
> > > > for(int i=0;i<dirs.length;i++)
> > > >  searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
> > > > }
> > > >
> > > >  protected Searcher getSearcher(int[] arr) throws Exception
> > > > {
> > > > // provides a ParallelMultiSearcher instance with the searchers lying
> > at
> > > > index positions provided in the argument
> > > >  Searcher[] s = new Searcher[arr.length];
> > > > for(int i=0;i<arr.length;i++)
> > > >  s[i] = this.searchers[arr[i]];
> > > >
> > > > return new ParallelMultiSearcher(s);
> > > >  }
> > > >
> > > > protected ScoreDoc[] search(String query, String filter, Searcher s)
> > > throws
> > > > Exception
> > > >  {
> > > > Filter f = null;
> > > > if(filter != null)
> > > >  {
> > > > if(filters.containsKey(filter))
> > > > {
> > > >  System.out.println("Reusing filter for - " + filter);
> > > > f = filters.get(filter);
> > > >  }
> > > > else
> > > > {
> > > >  System.out.println("Creating new filter for - " + filter);
> > > > f = new CachingWrapperFilter(new
> QueryWrapperFilter(qp.parse(filter)));
> > > >  filters.put(filter, f);
> > > > }
> > > > }
> > > >  System.out.println("Query:("+query+"), Filter:("+filter+")");
> > > > return s.search(qp.parse(query), f, 1000).scoreDocs;
> > > >  }
> > > >
> > > > public static void main(String[] args) throws Exception
> > > >  {
> > > >  FilterTest ft = new FilterTest();
> > > > ft.startTest();
> > > > }
> > > >
> > > > public void startTest()
> > > > {
> > > > try
> > > >  {
> > > > Query q;
> > > >
> > > > createDirectories(3);
> > > >  createIndexes();
> > > > openSearchers();
> > > > Searcher s;
> > > >  ScoreDoc[] sd;
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Fields of all the documents");
> > > > // creating searcher for all indexes
> > > >  s = getSearcher(new int[]{0,1,2});
> > > > // search all documents and their ids
> > > >  sd = search("+content:common", null, s);
> > > > for(int i=0;i<sd.length;i++)
> > > >  {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > > System.out.println("\n\n");
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Searching for documents in a single index.
> Filter
> > > > will be created and cached");
> > > > s = getSearcher(new int[]{0});
> > > >  sd = search("+content:common", "docnumber:1", s);
> > > > System.out.println("Hits:"+sd.length);
> > > >  for(int i=0;i<sd.length;i++)
> > > > {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > > System.out.println("\n\n");
> > > >
> > > > System.out.println("===================================");
> > > >  System.out.println("Searching for documents in a other indexes other
> > > than
> > > > previous search. Query and filter will be same. Filter will be
> > reused");
> > > >  s = getSearcher(new int[]{1,2});
> > > > sd = search("+content:common", "docnumber:1", s);
> > > >  System.out.println("Hits:"+sd.length);
> > > > for(int i=0;i<sd.length;i++)
> > > >  {
> > > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > > >  }
> > > >
> > > > }
> > > > catch(Exception e)
> > > >  {
> > > > e.printStackTrace();
> > > > }
> > > >  }
> > > > }
> > > >
> > > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > > OUTPUT:
> > > > [samar@myserver java]$ java FilterTest
> > > > ===================================
> > > > Fields of all the documents
> > > > Query:(+content:common), Filter:(null)
> > > >         id:1, docid:1
> > > >         id:2, docid:2
> > > >         id:3, docid:1
> > > >         id:4, docid:2
> > > >         id:5, docid:1
> > > >         id:6, docid:2
> > > >
> > > >
> > > >
> > > > ===================================
> > > > Searching for documents in a single index. Filter will be created and
> > > > cached
> > > > Creating new filter for - docid:1
> > > > Query:(+content:common), Filter:(docid:1)
> > > > Hits:1
> > > >         id:1, docid:1
> > > >
> > > >
> > > >
> > > > ===================================
> > > > Searching for documents in indexes other than previous search. Query
> > and
> > > > filter will be same. Filter will be reused
> > > > Reusing filter for - docid:1
> > > > Query:(+content:common), Filter:(docid:1)
> > > > Hits:2
> > > >         id:3, docid:1
> > > >         id:5, docid:1
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <
> > erickerickson@gmail.com
> > > >wrote:
> > > >
> > > >> I'm assuming you're down in Lucene land. Unless somehow you've
> > > >> gotten 63 separate filters when you think you only have one, I don't
> > > >> think what you're doing will work. Or I'm failing to understand what
> > > >> you're doing at all.
> > > >>
> > > >> The problem is I expect each of your indexes starts with document
> > > >> 1. So your Filter is really a bit set keyed by Lucene document ID.
> > > >>
> > > >> So applying filter 2 to index 54 will NOT do what you want. What I
> > > >> suspect you're seeing is that applying your filter is producing
> enough
> > > >> results from index 54 (to continue my example) to fool you into
> > > >> thinking it's working.
> > > >>
> > > >> Try running the query with and without the filter on each of your
> > > indexes,
> > > >> perhaps as a control including a restrictive clause in the query
> > > >> to do the same thing your filter is doing. Or construct the filter
> new
> > > >> for comparison.... If the numbers continue to be the same, I clearly
> > > >> don't understand something! <G>....
> > > >>
> > > >> Best
> > > >> Erick
> > > >>
> > > >> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <
> > samarzone@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Hi. We have a large index (~ 28 GB) which is distributed in three
> > > >> different
> > > >> > directories, each representing a country. Each of these country
> wise
> > > >> > indexes
> > > >> > is further distributed on the basis of last update date into 21
> > > smaller
> > > >> > indexes. This index is updated once in a day.
> > > >> >
> > > >> > A user can search into any of one country and can choose last
> update
> > > >> date
> > > >> > plus some other criteria.
> > > >> >
> > > >> > When the server application starts, index readers and hence
> > searchers
> > > >> are
> > > >> > created for each of the small indexes (21 x 3) and put in an
> array.
> > > >> > Depending on the option (country and last update date) chosen by
> > user
> > > we
> > > >> > pick the searchers of correct date range/country and create a new
> > > >> > ParallelMultiSearcher instance.
> > > >> >
> > > >> > Now my question is - can I use single filter (caching filter)
> > instance
> > > >> for
> > > >> > every search (may be on different searchers)?
> > > >> >
> > > >> >
> > > >>
> > >
> >
> ===================================================================================
> > > >> >
> > > >> > e.g
> > > >> > for first search i create an filter of experience 4 years and save
> > it.
> > > >> >
> > > >> > if another search for a different country (and hence difference
> > index)
> > > >> also
> > > >> > has same experience criteria, i.e. 4 years, can i use the same
> > filter
> > > >> > instance for second search too?
> > > >> >
> > > >> > i have tested a little for this and surprisingly i have got
> correct
> > > >> > results.
> > > >> > i was wondering if this is the correct way. or do i need to create
> > > >> > different
> > > >> > filters for each searcher (or index reader) instance?
> > > >> >
> > > >> > Thanks in advance.
> > > >> >
> > > >> > --
> > > >> > Regards,
> > > >> > Samar
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Samar
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Samar
> >
>



-- 
Regards,
Samar

Re: Single filter instance with different searchers

Posted by Erick Erickson <er...@gmail.com>.

I'm going to have to leave answering that to people with more
familiarity with the underlying code than I have...

That said, I'd #guess# that you'll be OK because I'd #guess# that
filters are maintained on a per-reader basis and the results
are synthesized when combined in a MultiSearcher.

But that's all a guess....

Best
Erick

On Tue, Nov 9, 2010 at 2:48 AM, Samarendra Pratap <sa...@gmail.com>wrote:

> Thanks Erick, you cleared some of my confusions. But I still have a doubt.
>
>  As you can see in previous example code I am re-creating parallel multi
> searcher for each search. (This is the actual scenario on production
> servers)
> The ParallelMultiSearcher constructor is taking different combination of
> searchers each time. It means that the same document may be assigned a
> different docid for next search.
>
> So my primary question is - Will the cached results from a filter created
> with one multi searcher, work fine with another multi searcher? (Underlying
> IndexSearchers are opened only once. It is combination of IndexSearchers
> which is varying for each search.)
>
>  I have tested it with my real code and sample indexes and it gives me a
> feeling that results are correct, but I am not able to understand how,
> given
> my above confusion.
>
> Can you suggest me a something with another curiosity - Which option will
> be
> more efficient - 1. MultiSearchers  (either recreating for each search or
> reusing cached ones) with different searchers or 2. having a single index
> for all last update date criteria and using filters for different
> combinations of last update dates.
>  As I wrote in my previous mail we have different physical indexes based on
> different ranges of update dates. We select appropriate indexes based on
> the
> user selected options.
>
> On Tue, Nov 9, 2010 at 4:25 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Ignore my previous, I thought you were constructing your own filters.
> What
> > you're doing should
> > be OK.
> >
> > Here's the source of my confusion.  Each of your indexes has Lucene
> > document
> > IDs starting at
> > 0. In your example, you have two docs/index. So, if you created a Filter
> > via
> > lower-level
> > calls, it could not be applied across different indexes. See the
> discussion
> > here:
> > http://www.gossamer-threads.com/lists/lucene/java-user/106376. That is,
> > the bit in your Filter for index0, doc0 would be the same bit as in
> index1,
> > doc0.
> >
> > But, that's not what you are doing. The (Parallel)MultiSearcher takes
> > care of mapping these doc IDs appropriately for you so you don't have to
> > worry about
> > what I was thinking about. Here's a program that illustrates this. It
> > creates
> > three RAMDirectories then  dumps the Lucene doc ID from each. Then it
> > creates
> > a multisearcher from the same three dirs and walks that, dumping the
> Lucene
> > doc ID.
> > You'll see that the doc IDs change even though the contents are the
> > same....
> >
> > Again, though, this isn't a problem because you are using a
> MultiSearcher,
> > which
> > takes care of this for you.
> >
> > Which is yet another reason to never, never, never count on lucene doc
> IDs
> > outside their context!
> >
> > Output at the end......
> >
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.index.IndexWriter;
> > import org.apache.lucene.search.*;
> > import org.apache.lucene.store.Directory;
> > import org.apache.lucene.store.RAMDirectory;
> > import org.apache.lucene.util.Version;
> >
> > import java.io.IOException;
> >
> > import static org.apache.lucene.index.IndexWriter.*;
> >
> > public class EoeTest {
> >    public static void main(String[] args) {
> >        EoeTest eoe = new EoeTest();
> >        eoe.doIt();
> >    }
> >    private void doIt() {
> >        try {
> >            populateIndexes();
> >            searchAndSpit();
> >            tryMulti();
> >         } catch (Exception e) {
> >            e.printStackTrace();
> >        }
> >
> >    }
> >
> >     private Searcher getMulti() throws IOException {
> >        IndexSearcher[] searchers = new IndexSearcher[3];
> >        searchers[0] = new IndexSearcher(_ram1, true);
> >        searchers[1] = new IndexSearcher(_ram2, true);
> >        searchers[2] = new IndexSearcher(_ram3, true);
> >        return new MultiSearcher(searchers);
> >    }
> >    private void tryMulti() throws IOException {
> >        searchOne("multi ", getMulti());
> >    }
> >
> >    private void searchAndSpit() throws IOException {
> >        searchOne("ram1", new IndexSearcher(_ram1, true));
> >        searchOne("ram2", new IndexSearcher(_ram2, true));
> >        searchOne("ram3", new IndexSearcher(_ram3, true));
> >    }
> >    private void searchOne(String which, Searcher is) throws IOException {
> >        log("dumping " + which);
> >        TopDocs hits = is.search(new MatchAllDocsQuery(), 100);
> >        for (int idx = 0; idx < hits.scoreDocs.length; ++idx) {
> >            ScoreDoc sd = hits.scoreDocs[idx];
> >            Document doc = is.doc(sd.doc);
> >            log(String.format("lid: %d, content: %s", sd.doc,
> > doc.get("content")));
> >        }
> >        is.close();
> >    }
> >    private void log(String msg) {
> >        System.out.println(msg);
> >    }
> >    private void populateIndexes() throws IOException {
> >        popOne(_ram1);
> >        popOne(_ram2);
> >        popOne(_ram3);
> >    }
> >
> >    private void popOne(Directory dir) throws IOException {
> >        IndexWriter iw = new IndexWriter(dir, _std,
> MaxFieldLength.LIMITED);
> >        Document doc = new Document();
> >        doc.add(new Field("content", "common " +
> > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > Field.TermVector.YES));
> >        iw.addDocument(doc);
> >
> >        doc = new Document();
> >        doc.add(new Field("content", "common " +
> > Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> > Field.TermVector.YES));
> >        iw.addDocument(doc);
> >
> >        iw.close();
> >    }
> >
> >
> >    Directory _ram1 = new RAMDirectory();
> >    Directory _ram2 = new RAMDirectory();
> >    Directory _ram3 = new RAMDirectory();
> >    Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
> > }
> >
> > ************************************output****************
> > where lid: ### is the Lucene doc ID returned in scoreDocs
> > ***********************************************************
> >
> > dumping ram1
> > lid: 0, content: common 0.11100571422470962
> > lid: 1, content: common 0.31555863707233567
> > dumping ram2
> > lid: 0, content: common 0.01235509997022377
> > lid: 1, content: common 0.7017712652104814
> > dumping ram3
> > lid: 0, content: common 0.9472403989314128
> > lid: 1, content: common 0.7105628402082196
> > dumping multi
> > lid: 0, content: common 0.11100571422470962
> > lid: 1, content: common 0.31555863707233567
> > lid: 2, content: common 0.01235509997022377
> > lid: 3, content: common 0.7017712652104814
> > lid: 4, content: common 0.9472403989314128
> > lid: 5, content: common 0.7105628402082196
> >
> >
> >
> >
> > On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap <samarzone@gmail.com
> > >wrote:
> >
> > > Hi Erick, Thanks for the reply.
> > >  Your answer have puzzled me more because what I am able to view is not
> > > what you say or I am not able to grasp your meaning.
> > >  I have written a small program which is exactly what my original
> > question
> > > was. Here I am creating a CachingWrapperFilter on one index and reusing
> > it
> > > on other indexes. This single filter gives me results as expected from
> > each
> > > of the index. I will appreciate if you can throw some light.
> > >
> > > I have given the output after the program ends
> > >
> > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > // following program is compiled with java6
> > >
> > > import org.apache.lucene.index.*;
> > > import org.apache.lucene.analysis.*;
> > > import org.apache.lucene.analysis.standard.*;
> > > import org.apache.lucene.search.*;
> > > import org.apache.lucene.search.spans.*;
> > > import org.apache.lucene.store.*;
> > > import org.apache.lucene.document.*;
> > > import org.apache.lucene.queryParser.*;
> > > import org.apache.lucene.util.*;
> > >
> > > import java.util.*;
> > >
> > > public class FilterTest
> > > {
> > > protected Directory[] dirs;
> > >  protected Analyzer a;
> > > protected Searcher[] searchers;
> > > protected QueryParser qp;
> > >  protected Filter f;
> > > protected Hashtable<String, Filter> filters;
> > >
> > >  public FilterTest()
> > > {
> > > // create analyzer
> > >  a = new StandardAnalyzer(Version.LUCENE_29);
> > > // create query parser
> > > qp = new QueryParser(Version.LUCENE_29, "content", a);
> > >  // initialize "filters" Hashtable
> > > filters = new Hashtable<String, Filter>();
> > >  }
> > >
> > > protected void createDirectories(int length)
> > > {
> > >  // create specified number of RAM directories
> > > dirs = new Directory[length];
> > >  for(int i=0;i<length;i++)
> > > dirs[i] = new RAMDirectory();
> > > }
> > >
> > > protected void createIndexes() throws Exception
> > > {
> > > /* create indexes for each directory.
> > >  each index contains two documents.
> > > every document contains one term, unique across all indexes, one term
> > > unique across single index and one term common to all indexes
> > >  */
> > > for(int i=0;i<dirs.length;i++)
> > > {
> > >  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> > > IndexWriter.MaxFieldLength.LIMITED);
> > >
> > > Document d = new Document();
> > >  // unique id across all indexes
> > > d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > >  // unique id in a single indexes
> > > d.add(new Field("docnumber", "1", Field.Store.YES,
> > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > >  // common word in all indexes
> > > d.add(new Field("content", "common", Field.Store.YES,
> > Field.Index.ANALYZED,
> > > Field.TermVector.YES));
> > >  iw.addDocument(d);
> > >
> > > d = new Document();
> > > // unique id across all indexes
> > >  d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
> > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > // unique id in a single indexes
> > >  d.add(new Field("docnumber", "2", Field.Store.YES,
> > > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > > // common word in all indexes
> > >  d.add(new Field("content", "common", Field.Store.YES,
> > > Field.Index.ANALYZED, Field.TermVector.YES));
> > > iw.addDocument(d);
> > >
> > > iw.close();
> > > }
> > > }
> > >
> > > protected void openSearchers() throws Exception
> > > {
> > > // open searches for every directory and save it in an array
> > >  searchers = new Searcher[dirs.length];
> > > for(int i=0;i<dirs.length;i++)
> > >  searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
> > > }
> > >
> > >  protected Searcher getSearcher(int[] arr) throws Exception
> > > {
> > > // provides a ParallelMultiSearcher instance with the searchers lying
> at
> > > index positions provided in the argument
> > >  Searcher[] s = new Searcher[arr.length];
> > > for(int i=0;i<arr.length;i++)
> > >  s[i] = this.searchers[arr[i]];
> > >
> > > return new ParallelMultiSearcher(s);
> > >  }
> > >
> > > protected ScoreDoc[] search(String query, String filter, Searcher s)
> > throws
> > > Exception
> > >  {
> > > Filter f = null;
> > > if(filter != null)
> > >  {
> > > if(filters.containsKey(filter))
> > > {
> > >  System.out.println("Reusing filter for - " + filter);
> > > f = filters.get(filter);
> > >  }
> > > else
> > > {
> > >  System.out.println("Creating new filter for - " + filter);
> > > f = new CachingWrapperFilter(new QueryWrapperFilter(qp.parse(filter)));
> > >  filters.put(filter, f);
> > > }
> > > }
> > >  System.out.println("Query:("+query+"), Filter:("+filter+")");
> > > return s.search(qp.parse(query), f, 1000).scoreDocs;
> > >  }
> > >
> > > public static void main(String[] args) throws Exception
> > >  {
> > >  FilterTest ft = new FilterTest();
> > > ft.startTest();
> > > }
> > >
> > > public void startTest()
> > > {
> > > try
> > >  {
> > > Query q;
> > >
> > > createDirectories(3);
> > >  createIndexes();
> > > openSearchers();
> > > Searcher s;
> > >  ScoreDoc[] sd;
> > >
> > > System.out.println("===================================");
> > >  System.out.println("Fields of all the documents");
> > > // creating searcher for all indexes
> > >  s = getSearcher(new int[]{0,1,2});
> > > // search all documents and their ids
> > >  sd = search("+content:common", null, s);
> > > for(int i=0;i<sd.length;i++)
> > >  {
> > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > >  }
> > > System.out.println("\n\n");
> > >
> > > System.out.println("===================================");
> > >  System.out.println("Searching for documents in a single index. Filter
> > > will be created and cached");
> > > s = getSearcher(new int[]{0});
> > >  sd = search("+content:common", "docnumber:1", s);
> > > System.out.println("Hits:"+sd.length);
> > >  for(int i=0;i<sd.length;i++)
> > > {
> > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > >  }
> > > System.out.println("\n\n");
> > >
> > > System.out.println("===================================");
> > >  System.out.println("Searching for documents in a other indexes other
> > than
> > > previous search. Query and filter will be same. Filter will be
> reused");
> > >  s = getSearcher(new int[]{1,2});
> > > sd = search("+content:common", "docnumber:1", s);
> > >  System.out.println("Hits:"+sd.length);
> > > for(int i=0;i<sd.length;i++)
> > >  {
> > > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> > >  }
> > >
> > > }
> > > catch(Exception e)
> > >  {
> > > e.printStackTrace();
> > > }
> > >  }
> > > }
> > >
> > >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > > OUTPUT:
> > > [samar@myserver java]$ java FilterTest
> > > ===================================
> > > Fields of all the documents
> > > Query:(+content:common), Filter:(null)
> > >         id:1, docid:1
> > >         id:2, docid:2
> > >         id:3, docid:1
> > >         id:4, docid:2
> > >         id:5, docid:1
> > >         id:6, docid:2
> > >
> > >
> > >
> > > ===================================
> > > Searching for documents in a single index. Filter will be created and
> > > cached
> > > Creating new filter for - docid:1
> > > Query:(+content:common), Filter:(docid:1)
> > > Hits:1
> > >         id:1, docid:1
> > >
> > >
> > >
> > > ===================================
> > > Searching for documents in indexes other than previous search. Query
> and
> > > filter will be same. Filter will be reused
> > > Reusing filter for - docid:1
> > > Query:(+content:common), Filter:(docid:1)
> > > Hits:2
> > >         id:3, docid:1
> > >         id:5, docid:1
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <
> erickerickson@gmail.com
> > >wrote:
> > >
> > >> I'm assuming you're down in Lucene land. Unless somehow you've
> > >> gotten 63 separate filters when you think you only have one, I don't
> > >> think what you're doing will work. Or I'm failing to understand what
> > >> you're doing at all.
> > >>
> > >> The problem is I expect each of your indexes starts with document
> > >> 1. So your Filter is really a bit set keyed by Lucene document ID.
> > >>
> > >> So applying filter 2 to index 54 will NOT do what you want. What I
> > >> suspect you're seeing is that applying your filter is producing enough
> > >> results from index 54 (to continue my example) to fool you into
> > >> thinking it's working.
> > >>
> > >> Try running the query with and without the filter on each of your
> > indexes,
> > >> perhaps as a control including a restrictive clause in the query
> > >> to do the same thing your filter is doing. Or construct the filter new
> > >> for comparison.... If the numbers continue to be the same, I clearly
> > >> don't understand something! <G>....
> > >>
> > >> Best
> > >> Erick
> > >>
> > >> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <
> samarzone@gmail.com
> > >> >wrote:
> > >>
> > >> > Hi. We have a large index (~ 28 GB) which is distributed in three
> > >> different
> > >> > directories, each representing a country. Each of these country wise
> > >> > indexes
> > >> > is further distributed on the basis of last update date into 21
> > smaller
> > >> > indexes. This index is updated once in a day.
> > >> >
> > >> > A user can search into any of one country and can choose last update
> > >> date
> > >> > plus some other criteria.
> > >> >
> > >> > When the server application starts, index readers and hence
> searchers
> > >> are
> > >> > created for each of the small indexes (21 x 3) and put in an array.
> > >> > Depending on the option (country and last update date) chosen by
> user
> > we
> > >> > pick the searchers of correct date range/country and create a new
> > >> > ParallelMultiSearcher instance.
> > >> >
> > >> > Now my question is - can I use single filter (caching filter)
> instance
> > >> for
> > >> > every search (may be on different searchers)?
> > >> >
> > >> >
> > >>
> >
> ===================================================================================
> > >> >
> > >> > e.g
> > >> > for first search i create an filter of experience 4 years and save
> it.
> > >> >
> > >> > if another search for a different country (and hence difference
> index)
> > >> also
> > >> > has same experience criteria, i.e. 4 years, can i use the same
> filter
> > >> > instance for second search too?
> > >> >
> > >> > i have tested a little for this and surprisingly i have got correct
> > >> > results.
> > >> > i was wondering if this is the correct way. or do i need to create
> > >> > different
> > >> > filters for each searcher (or index reader) instance?
> > >> >
> > >> > Thanks in advance.
> > >> >
> > >> > --
> > >> > Regards,
> > >> > Samar
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Samar
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
>
>
>
> --
> Regards,
> Samar
>

Re: Single filter instance with different searchers

Posted by Samarendra Pratap <sa...@gmail.com>.

Thanks Erick, you cleared some of my confusions. But I still have a doubt.

 As you can see in previous example code I am re-creating parallel multi
searcher for each search. (This is the actual scenario on production
servers)
The ParallelMultiSearcher constructor is taking different combination of
searchers each time. It means that the same document may be assigned a
different docid for next search.

So my primary question is - Will the cached results from a filter created
with one multi searcher, work fine with another multi searcher? (Underlying
IndexSearchers are opened only once. It is combination of IndexSearchers
which is varying for each search.)

 I have tested it with my real code and sample indexes and it gives me a
feeling that results are correct, but I am not able to understand how, given
my above confusion.

Can you suggest me a something with another curiosity - Which option will be
more efficient - 1. MultiSearchers  (either recreating for each search or
reusing cached ones) with different searchers or 2. having a single index
for all last update date criteria and using filters for different
combinations of last update dates.
 As I wrote in my previous mail we have different physical indexes based on
different ranges of update dates. We select appropriate indexes based on the
user selected options.

On Tue, Nov 9, 2010 at 4:25 AM, Erick Erickson <er...@gmail.com>wrote:

> Ignore my previous, I thought you were constructing your own filters. What
> you're doing should
> be OK.
>
> Here's the source of my confusion.  Each of your indexes has Lucene
> document
> IDs starting at
> 0. In your example, you have two docs/index. So, if you created a Filter
> via
> lower-level
> calls, it could not be applied across different indexes. See the discussion
> here:
> http://www.gossamer-threads.com/lists/lucene/java-user/106376. That is,
> the bit in your Filter for index0, doc0 would be the same bit as in index1,
> doc0.
>
> But, that's not what you are doing. The (Parallel)MultiSearcher takes
> care of mapping these doc IDs appropriately for you so you don't have to
> worry about
> what I was thinking about. Here's a program that illustrates this. It
> creates
> three RAMDirectories then  dumps the Lucene doc ID from each. Then it
> creates
> a multisearcher from the same three dirs and walks that, dumping the Lucene
> doc ID.
> You'll see that the doc IDs change even though the contents are the
> same....
>
> Again, though, this isn't a problem because you are using a MultiSearcher,
> which
> takes care of this for you.
>
> Which is yet another reason to never, never, never count on lucene doc IDs
> outside their context!
>
> Output at the end......
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.search.*;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Version;
>
> import java.io.IOException;
>
> import static org.apache.lucene.index.IndexWriter.*;
>
> public class EoeTest {
>    public static void main(String[] args) {
>        EoeTest eoe = new EoeTest();
>        eoe.doIt();
>    }
>    private void doIt() {
>        try {
>            populateIndexes();
>            searchAndSpit();
>            tryMulti();
>         } catch (Exception e) {
>            e.printStackTrace();
>        }
>
>    }
>
>     private Searcher getMulti() throws IOException {
>        IndexSearcher[] searchers = new IndexSearcher[3];
>        searchers[0] = new IndexSearcher(_ram1, true);
>        searchers[1] = new IndexSearcher(_ram2, true);
>        searchers[2] = new IndexSearcher(_ram3, true);
>        return new MultiSearcher(searchers);
>    }
>    private void tryMulti() throws IOException {
>        searchOne("multi ", getMulti());
>    }
>
>    private void searchAndSpit() throws IOException {
>        searchOne("ram1", new IndexSearcher(_ram1, true));
>        searchOne("ram2", new IndexSearcher(_ram2, true));
>        searchOne("ram3", new IndexSearcher(_ram3, true));
>    }
>    private void searchOne(String which, Searcher is) throws IOException {
>        log("dumping " + which);
>        TopDocs hits = is.search(new MatchAllDocsQuery(), 100);
>        for (int idx = 0; idx < hits.scoreDocs.length; ++idx) {
>            ScoreDoc sd = hits.scoreDocs[idx];
>            Document doc = is.doc(sd.doc);
>            log(String.format("lid: %d, content: %s", sd.doc,
> doc.get("content")));
>        }
>        is.close();
>    }
>    private void log(String msg) {
>        System.out.println(msg);
>    }
>    private void populateIndexes() throws IOException {
>        popOne(_ram1);
>        popOne(_ram2);
>        popOne(_ram3);
>    }
>
>    private void popOne(Directory dir) throws IOException {
>        IndexWriter iw = new IndexWriter(dir, _std, MaxFieldLength.LIMITED);
>        Document doc = new Document();
>        doc.add(new Field("content", "common " +
> Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES));
>        iw.addDocument(doc);
>
>        doc = new Document();
>        doc.add(new Field("content", "common " +
> Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES));
>        iw.addDocument(doc);
>
>        iw.close();
>    }
>
>
>    Directory _ram1 = new RAMDirectory();
>    Directory _ram2 = new RAMDirectory();
>    Directory _ram3 = new RAMDirectory();
>    Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
> }
>
> ************************************output****************
> where lid: ### is the Lucene doc ID returned in scoreDocs
> ***********************************************************
>
> dumping ram1
> lid: 0, content: common 0.11100571422470962
> lid: 1, content: common 0.31555863707233567
> dumping ram2
> lid: 0, content: common 0.01235509997022377
> lid: 1, content: common 0.7017712652104814
> dumping ram3
> lid: 0, content: common 0.9472403989314128
> lid: 1, content: common 0.7105628402082196
> dumping multi
> lid: 0, content: common 0.11100571422470962
> lid: 1, content: common 0.31555863707233567
> lid: 2, content: common 0.01235509997022377
> lid: 3, content: common 0.7017712652104814
> lid: 4, content: common 0.9472403989314128
> lid: 5, content: common 0.7105628402082196
>
>
>
>
> On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap <samarzone@gmail.com
> >wrote:
>
> > Hi Erick, Thanks for the reply.
> >  Your answer have puzzled me more because what I am able to view is not
> > what you say or I am not able to grasp your meaning.
> >  I have written a small program which is exactly what my original
> question
> > was. Here I am creating a CachingWrapperFilter on one index and reusing
> it
> > on other indexes. This single filter gives me results as expected from
> each
> > of the index. I will appreciate if you can throw some light.
> >
> > I have given the output after the program ends
> >
> >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > // following program is compiled with java6
> >
> > import org.apache.lucene.index.*;
> > import org.apache.lucene.analysis.*;
> > import org.apache.lucene.analysis.standard.*;
> > import org.apache.lucene.search.*;
> > import org.apache.lucene.search.spans.*;
> > import org.apache.lucene.store.*;
> > import org.apache.lucene.document.*;
> > import org.apache.lucene.queryParser.*;
> > import org.apache.lucene.util.*;
> >
> > import java.util.*;
> >
> > public class FilterTest
> > {
> > protected Directory[] dirs;
> >  protected Analyzer a;
> > protected Searcher[] searchers;
> > protected QueryParser qp;
> >  protected Filter f;
> > protected Hashtable<String, Filter> filters;
> >
> >  public FilterTest()
> > {
> > // create analyzer
> >  a = new StandardAnalyzer(Version.LUCENE_29);
> > // create query parser
> > qp = new QueryParser(Version.LUCENE_29, "content", a);
> >  // initialize "filters" Hashtable
> > filters = new Hashtable<String, Filter>();
> >  }
> >
> > protected void createDirectories(int length)
> > {
> >  // create specified number of RAM directories
> > dirs = new Directory[length];
> >  for(int i=0;i<length;i++)
> > dirs[i] = new RAMDirectory();
> > }
> >
> > protected void createIndexes() throws Exception
> > {
> > /* create indexes for each directory.
> >  each index contains two documents.
> > every document contains one term, unique across all indexes, one term
> > unique across single index and one term common to all indexes
> >  */
> > for(int i=0;i<dirs.length;i++)
> > {
> >  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> > IndexWriter.MaxFieldLength.LIMITED);
> >
> > Document d = new Document();
> >  // unique id across all indexes
> > d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> >  // unique id in a single indexes
> > d.add(new Field("docnumber", "1", Field.Store.YES,
> > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> >  // common word in all indexes
> > d.add(new Field("content", "common", Field.Store.YES,
> Field.Index.ANALYZED,
> > Field.TermVector.YES));
> >  iw.addDocument(d);
> >
> > d = new Document();
> > // unique id across all indexes
> >  d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
> > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > // unique id in a single indexes
> >  d.add(new Field("docnumber", "2", Field.Store.YES,
> > Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> > // common word in all indexes
> >  d.add(new Field("content", "common", Field.Store.YES,
> > Field.Index.ANALYZED, Field.TermVector.YES));
> > iw.addDocument(d);
> >
> > iw.close();
> > }
> > }
> >
> > protected void openSearchers() throws Exception
> > {
> > // open searches for every directory and save it in an array
> >  searchers = new Searcher[dirs.length];
> > for(int i=0;i<dirs.length;i++)
> >  searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
> > }
> >
> >  protected Searcher getSearcher(int[] arr) throws Exception
> > {
> > // provides a ParallelMultiSearcher instance with the searchers lying at
> > index positions provided in the argument
> >  Searcher[] s = new Searcher[arr.length];
> > for(int i=0;i<arr.length;i++)
> >  s[i] = this.searchers[arr[i]];
> >
> > return new ParallelMultiSearcher(s);
> >  }
> >
> > protected ScoreDoc[] search(String query, String filter, Searcher s)
> throws
> > Exception
> >  {
> > Filter f = null;
> > if(filter != null)
> >  {
> > if(filters.containsKey(filter))
> > {
> >  System.out.println("Reusing filter for - " + filter);
> > f = filters.get(filter);
> >  }
> > else
> > {
> >  System.out.println("Creating new filter for - " + filter);
> > f = new CachingWrapperFilter(new QueryWrapperFilter(qp.parse(filter)));
> >  filters.put(filter, f);
> > }
> > }
> >  System.out.println("Query:("+query+"), Filter:("+filter+")");
> > return s.search(qp.parse(query), f, 1000).scoreDocs;
> >  }
> >
> > public static void main(String[] args) throws Exception
> >  {
> >  FilterTest ft = new FilterTest();
> > ft.startTest();
> > }
> >
> > public void startTest()
> > {
> > try
> >  {
> > Query q;
> >
> > createDirectories(3);
> >  createIndexes();
> > openSearchers();
> > Searcher s;
> >  ScoreDoc[] sd;
> >
> > System.out.println("===================================");
> >  System.out.println("Fields of all the documents");
> > // creating searcher for all indexes
> >  s = getSearcher(new int[]{0,1,2});
> > // search all documents and their ids
> >  sd = search("+content:common", null, s);
> > for(int i=0;i<sd.length;i++)
> >  {
> > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> >  }
> > System.out.println("\n\n");
> >
> > System.out.println("===================================");
> >  System.out.println("Searching for documents in a single index. Filter
> > will be created and cached");
> > s = getSearcher(new int[]{0});
> >  sd = search("+content:common", "docnumber:1", s);
> > System.out.println("Hits:"+sd.length);
> >  for(int i=0;i<sd.length;i++)
> > {
> > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> >  }
> > System.out.println("\n\n");
> >
> > System.out.println("===================================");
> >  System.out.println("Searching for documents in a other indexes other
> than
> > previous search. Query and filter will be same. Filter will be reused");
> >  s = getSearcher(new int[]{1,2});
> > sd = search("+content:common", "docnumber:1", s);
> >  System.out.println("Hits:"+sd.length);
> > for(int i=0;i<sd.length;i++)
> >  {
> > System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> > docnumber:"+s.doc(sd[i].doc).get("docnumber"));
> >  }
> >
> > }
> > catch(Exception e)
> >  {
> > e.printStackTrace();
> > }
> >  }
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> > OUTPUT:
> > [samar@myserver java]$ java FilterTest
> > ===================================
> > Fields of all the documents
> > Query:(+content:common), Filter:(null)
> >         id:1, docid:1
> >         id:2, docid:2
> >         id:3, docid:1
> >         id:4, docid:2
> >         id:5, docid:1
> >         id:6, docid:2
> >
> >
> >
> > ===================================
> > Searching for documents in a single index. Filter will be created and
> > cached
> > Creating new filter for - docid:1
> > Query:(+content:common), Filter:(docid:1)
> > Hits:1
> >         id:1, docid:1
> >
> >
> >
> > ===================================
> > Searching for documents in indexes other than previous search. Query and
> > filter will be same. Filter will be reused
> > Reusing filter for - docid:1
> > Query:(+content:common), Filter:(docid:1)
> > Hits:2
> >         id:3, docid:1
> >         id:5, docid:1
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> I'm assuming you're down in Lucene land. Unless somehow you've
> >> gotten 63 separate filters when you think you only have one, I don't
> >> think what you're doing will work. Or I'm failing to understand what
> >> you're doing at all.
> >>
> >> The problem is I expect each of your indexes starts with document
> >> 1. So your Filter is really a bit set keyed by Lucene document ID.
> >>
> >> So applying filter 2 to index 54 will NOT do what you want. What I
> >> suspect you're seeing is that applying your filter is producing enough
> >> results from index 54 (to continue my example) to fool you into
> >> thinking it's working.
> >>
> >> Try running the query with and without the filter on each of your
> indexes,
> >> perhaps as a control including a restrictive clause in the query
> >> to do the same thing your filter is doing. Or construct the filter new
> >> for comparison.... If the numbers continue to be the same, I clearly
> >> don't understand something! <G>....
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <samarzone@gmail.com
> >> >wrote:
> >>
> >> > Hi. We have a large index (~ 28 GB) which is distributed in three
> >> different
> >> > directories, each representing a country. Each of these country wise
> >> > indexes
> >> > is further distributed on the basis of last update date into 21
> smaller
> >> > indexes. This index is updated once in a day.
> >> >
> >> > A user can search into any of one country and can choose last update
> >> date
> >> > plus some other criteria.
> >> >
> >> > When the server application starts, index readers and hence searchers
> >> are
> >> > created for each of the small indexes (21 x 3) and put in an array.
> >> > Depending on the option (country and last update date) chosen by user
> we
> >> > pick the searchers of correct date range/country and create a new
> >> > ParallelMultiSearcher instance.
> >> >
> >> > Now my question is - can I use single filter (caching filter) instance
> >> for
> >> > every search (may be on different searchers)?
> >> >
> >> >
> >>
> ===================================================================================
> >> >
> >> > e.g
> >> > for first search i create an filter of experience 4 years and save it.
> >> >
> >> > if another search for a different country (and hence difference index)
> >> also
> >> > has same experience criteria, i.e. 4 years, can i use the same filter
> >> > instance for second search too?
> >> >
> >> > i have tested a little for this and surprisingly i have got correct
> >> > results.
> >> > i was wondering if this is the correct way. or do i need to create
> >> > different
> >> > filters for each searcher (or index reader) instance?
> >> >
> >> > Thanks in advance.
> >> >
> >> > --
> >> > Regards,
> >> > Samar
> >> >
> >>
> >
> >
> >
> > --
> > Regards,
> > Samar
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>



-- 
Regards,
Samar

Re: Single filter instance with different searchers

Posted by Erick Erickson <er...@gmail.com>.

Ignore my previous, I thought you were constructing your own filters. What
you're doing should
be OK.

Here's the source of my confusion.  Each of your indexes has Lucene document
IDs starting at
0. In your example, you have two docs/index. So, if you created a Filter via
lower-level
calls, it could not be applied across different indexes. See the discussion
here:
http://www.gossamer-threads.com/lists/lucene/java-user/106376. That is,
the bit in your Filter for index0, doc0 would be the same bit as in index1,
doc0.

But, that's not what you are doing. The (Parallel)MultiSearcher takes
care of mapping these doc IDs appropriately for you so you don't have to
worry about
what I was thinking about. Here's a program that illustrates this. It
creates
three RAMDirectories then  dumps the Lucene doc ID from each. Then it
creates
a multisearcher from the same three dirs and walks that, dumping the Lucene
doc ID.
You'll see that the doc IDs change even though the contents are the same....

Again, though, this isn't a problem because you are using a MultiSearcher,
which
takes care of this for you.

Which is yet another reason to never, never, never count on lucene doc IDs
outside their context!

Output at the end......

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;

import static org.apache.lucene.index.IndexWriter.*;

public class EoeTest {
    public static void main(String[] args) {
        EoeTest eoe = new EoeTest();
        eoe.doIt();
    }
    private void doIt() {
        try {
            populateIndexes();
            searchAndSpit();
            tryMulti();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    private Searcher getMulti() throws IOException {
        IndexSearcher[] searchers = new IndexSearcher[3];
        searchers[0] = new IndexSearcher(_ram1, true);
        searchers[1] = new IndexSearcher(_ram2, true);
        searchers[2] = new IndexSearcher(_ram3, true);
        return new MultiSearcher(searchers);
    }
    private void tryMulti() throws IOException {
        searchOne("multi ", getMulti());
    }

    private void searchAndSpit() throws IOException {
        searchOne("ram1", new IndexSearcher(_ram1, true));
        searchOne("ram2", new IndexSearcher(_ram2, true));
        searchOne("ram3", new IndexSearcher(_ram3, true));
    }
    private void searchOne(String which, Searcher is) throws IOException {
        log("dumping " + which);
        TopDocs hits = is.search(new MatchAllDocsQuery(), 100);
        for (int idx = 0; idx < hits.scoreDocs.length; ++idx) {
            ScoreDoc sd = hits.scoreDocs[idx];
            Document doc = is.doc(sd.doc);
            log(String.format("lid: %d, content: %s", sd.doc,
doc.get("content")));
        }
        is.close();
    }
    private void log(String msg) {
        System.out.println(msg);
    }
    private void populateIndexes() throws IOException {
        popOne(_ram1);
        popOne(_ram2);
        popOne(_ram3);
    }

    private void popOne(Directory dir) throws IOException {
        IndexWriter iw = new IndexWriter(dir, _std, MaxFieldLength.LIMITED);
        Document doc = new Document();
        doc.add(new Field("content", "common " +
Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.YES));
        iw.addDocument(doc);

        doc = new Document();
        doc.add(new Field("content", "common " +
Double.toString(Math.random()), Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.YES));
        iw.addDocument(doc);

        iw.close();
    }


    Directory _ram1 = new RAMDirectory();
    Directory _ram2 = new RAMDirectory();
    Directory _ram3 = new RAMDirectory();
    Analyzer _std = new StandardAnalyzer(Version.LUCENE_29);
}

************************************output****************
where lid: ### is the Lucene doc ID returned in scoreDocs
***********************************************************

dumping ram1
lid: 0, content: common 0.11100571422470962
lid: 1, content: common 0.31555863707233567
dumping ram2
lid: 0, content: common 0.01235509997022377
lid: 1, content: common 0.7017712652104814
dumping ram3
lid: 0, content: common 0.9472403989314128
lid: 1, content: common 0.7105628402082196
dumping multi
lid: 0, content: common 0.11100571422470962
lid: 1, content: common 0.31555863707233567
lid: 2, content: common 0.01235509997022377
lid: 3, content: common 0.7017712652104814
lid: 4, content: common 0.9472403989314128
lid: 5, content: common 0.7105628402082196




On Mon, Nov 8, 2010 at 3:33 AM, Samarendra Pratap <sa...@gmail.com>wrote:

> Hi Erick, Thanks for the reply.
>  Your answer have puzzled me more because what I am able to view is not
> what you say or I am not able to grasp your meaning.
>  I have written a small program which is exactly what my original question
> was. Here I am creating a CachingWrapperFilter on one index and reusing it
> on other indexes. This single filter gives me results as expected from each
> of the index. I will appreciate if you can throw some light.
>
> I have given the output after the program ends
>
>
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> // following program is compiled with java6
>
> import org.apache.lucene.index.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.analysis.standard.*;
> import org.apache.lucene.search.*;
> import org.apache.lucene.search.spans.*;
> import org.apache.lucene.store.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.queryParser.*;
> import org.apache.lucene.util.*;
>
> import java.util.*;
>
> public class FilterTest
> {
> protected Directory[] dirs;
>  protected Analyzer a;
> protected Searcher[] searchers;
> protected QueryParser qp;
>  protected Filter f;
> protected Hashtable<String, Filter> filters;
>
>  public FilterTest()
> {
> // create analyzer
>  a = new StandardAnalyzer(Version.LUCENE_29);
> // create query parser
> qp = new QueryParser(Version.LUCENE_29, "content", a);
>  // initialize "filters" Hashtable
> filters = new Hashtable<String, Filter>();
>  }
>
> protected void createDirectories(int length)
> {
>  // create specified number of RAM directories
> dirs = new Directory[length];
>  for(int i=0;i<length;i++)
> dirs[i] = new RAMDirectory();
> }
>
> protected void createIndexes() throws Exception
> {
> /* create indexes for each directory.
>  each index contains two documents.
> every document contains one term, unique across all indexes, one term
> unique across single index and one term common to all indexes
>  */
> for(int i=0;i<dirs.length;i++)
> {
>  IndexWriter iw = new IndexWriter(dirs[i], a, true,
> IndexWriter.MaxFieldLength.LIMITED);
>
> Document d = new Document();
>  // unique id across all indexes
> d.add(new Field("id", ""+(i*2+1), Field.Store.YES,
> Field.Index.NOT_ANALYZED, Field.TermVector.YES));
>  // unique id in a single indexes
> d.add(new Field("docnumber", "1", Field.Store.YES,
> Field.Index.NOT_ANALYZED, Field.TermVector.YES));
>  // common word in all indexes
> d.add(new Field("content", "common", Field.Store.YES, Field.Index.ANALYZED,
> Field.TermVector.YES));
>  iw.addDocument(d);
>
> d = new Document();
> // unique id across all indexes
>  d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
> Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> // unique id in a single indexes
>  d.add(new Field("docnumber", "2", Field.Store.YES,
> Field.Index.NOT_ANALYZED, Field.TermVector.YES));
> // common word in all indexes
>  d.add(new Field("content", "common", Field.Store.YES,
> Field.Index.ANALYZED, Field.TermVector.YES));
> iw.addDocument(d);
>
> iw.close();
> }
> }
>
> protected void openSearchers() throws Exception
> {
> // open searches for every directory and save it in an array
>  searchers = new Searcher[dirs.length];
> for(int i=0;i<dirs.length;i++)
>  searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
> }
>
>  protected Searcher getSearcher(int[] arr) throws Exception
> {
> // provides a ParallelMultiSearcher instance with the searchers lying at
> index positions provided in the argument
>  Searcher[] s = new Searcher[arr.length];
> for(int i=0;i<arr.length;i++)
>  s[i] = this.searchers[arr[i]];
>
> return new ParallelMultiSearcher(s);
>  }
>
> protected ScoreDoc[] search(String query, String filter, Searcher s) throws
> Exception
>  {
> Filter f = null;
> if(filter != null)
>  {
> if(filters.containsKey(filter))
> {
>  System.out.println("Reusing filter for - " + filter);
> f = filters.get(filter);
>  }
> else
> {
>  System.out.println("Creating new filter for - " + filter);
> f = new CachingWrapperFilter(new QueryWrapperFilter(qp.parse(filter)));
>  filters.put(filter, f);
> }
> }
>  System.out.println("Query:("+query+"), Filter:("+filter+")");
> return s.search(qp.parse(query), f, 1000).scoreDocs;
>  }
>
> public static void main(String[] args) throws Exception
>  {
>  FilterTest ft = new FilterTest();
> ft.startTest();
> }
>
> public void startTest()
> {
> try
>  {
> Query q;
>
> createDirectories(3);
>  createIndexes();
> openSearchers();
> Searcher s;
>  ScoreDoc[] sd;
>
> System.out.println("===================================");
>  System.out.println("Fields of all the documents");
> // creating searcher for all indexes
>  s = getSearcher(new int[]{0,1,2});
> // search all documents and their ids
>  sd = search("+content:common", null, s);
> for(int i=0;i<sd.length;i++)
>  {
> System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> docnumber:"+s.doc(sd[i].doc).get("docnumber"));
>  }
> System.out.println("\n\n");
>
> System.out.println("===================================");
>  System.out.println("Searching for documents in a single index. Filter
> will be created and cached");
> s = getSearcher(new int[]{0});
>  sd = search("+content:common", "docnumber:1", s);
> System.out.println("Hits:"+sd.length);
>  for(int i=0;i<sd.length;i++)
> {
> System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> docnumber:"+s.doc(sd[i].doc).get("docnumber"));
>  }
> System.out.println("\n\n");
>
> System.out.println("===================================");
>  System.out.println("Searching for documents in a other indexes other than
> previous search. Query and filter will be same. Filter will be reused");
>  s = getSearcher(new int[]{1,2});
> sd = search("+content:common", "docnumber:1", s);
>  System.out.println("Hits:"+sd.length);
> for(int i=0;i<sd.length;i++)
>  {
> System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
> docnumber:"+s.doc(sd[i].doc).get("docnumber"));
>  }
>
> }
> catch(Exception e)
>  {
> e.printStackTrace();
> }
>  }
> }
>
> ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> OUTPUT:
> [samar@myserver java]$ java FilterTest
> ===================================
> Fields of all the documents
> Query:(+content:common), Filter:(null)
>         id:1, docid:1
>         id:2, docid:2
>         id:3, docid:1
>         id:4, docid:2
>         id:5, docid:1
>         id:6, docid:2
>
>
>
> ===================================
> Searching for documents in a single index. Filter will be created and
> cached
> Creating new filter for - docid:1
> Query:(+content:common), Filter:(docid:1)
> Hits:1
>         id:1, docid:1
>
>
>
> ===================================
> Searching for documents in indexes other than previous search. Query and
> filter will be same. Filter will be reused
> Reusing filter for - docid:1
> Query:(+content:common), Filter:(docid:1)
> Hits:2
>         id:3, docid:1
>         id:5, docid:1
>
>
>
>
>
>
>
>
>
>
> On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> I'm assuming you're down in Lucene land. Unless somehow you've
>> gotten 63 separate filters when you think you only have one, I don't
>> think what you're doing will work. Or I'm failing to understand what
>> you're doing at all.
>>
>> The problem is I expect each of your indexes starts with document
>> 1. So your Filter is really a bit set keyed by Lucene document ID.
>>
>> So applying filter 2 to index 54 will NOT do what you want. What I
>> suspect you're seeing is that applying your filter is producing enough
>> results from index 54 (to continue my example) to fool you into
>> thinking it's working.
>>
>> Try running the query with and without the filter on each of your indexes,
>> perhaps as a control including a restrictive clause in the query
>> to do the same thing your filter is doing. Or construct the filter new
>> for comparison.... If the numbers continue to be the same, I clearly
>> don't understand something! <G>....
>>
>> Best
>> Erick
>>
>> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <samarzone@gmail.com
>> >wrote:
>>
>> > Hi. We have a large index (~ 28 GB) which is distributed in three
>> different
>> > directories, each representing a country. Each of these country wise
>> > indexes
>> > is further distributed on the basis of last update date into 21 smaller
>> > indexes. This index is updated once in a day.
>> >
>> > A user can search into any of one country and can choose last update
>> date
>> > plus some other criteria.
>> >
>> > When the server application starts, index readers and hence searchers
>> are
>> > created for each of the small indexes (21 x 3) and put in an array.
>> > Depending on the option (country and last update date) chosen by user we
>> > pick the searchers of correct date range/country and create a new
>> > ParallelMultiSearcher instance.
>> >
>> > Now my question is - can I use single filter (caching filter) instance
>> for
>> > every search (may be on different searchers)?
>> >
>> >
>> ===================================================================================
>> >
>> > e.g
>> > for first search i create an filter of experience 4 years and save it.
>> >
>> > if another search for a different country (and hence difference index)
>> also
>> > has same experience criteria, i.e. 4 years, can i use the same filter
>> > instance for second search too?
>> >
>> > i have tested a little for this and surprisingly i have got correct
>> > results.
>> > i was wondering if this is the correct way. or do i need to create
>> > different
>> > filters for each searcher (or index reader) instance?
>> >
>> > Thanks in advance.
>> >
>> > --
>> > Regards,
>> > Samar
>> >
>>
>
>
>
> --
> Regards,
> Samar
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Single filter instance with different searchers

Posted by Samarendra Pratap <sa...@gmail.com>.

Hi Erick, Thanks for the reply.
 Your answer have puzzled me more because what I am able to view is not what
you say or I am not able to grasp your meaning.
 I have written a small program which is exactly what my original question
was. Here I am creating a CachingWrapperFilter on one index and reusing it
on other indexes. This single filter gives me results as expected from each
of the index. I will appreciate if you can throw some light.

I have given the output after the program ends

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// following program is compiled with java6

import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.search.*;
import org.apache.lucene.search.spans.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.util.*;

import java.util.*;

public class FilterTest
{
protected Directory[] dirs;
 protected Analyzer a;
protected Searcher[] searchers;
protected QueryParser qp;
 protected Filter f;
protected Hashtable<String, Filter> filters;

 public FilterTest()
{
// create analyzer
 a = new StandardAnalyzer(Version.LUCENE_29);
// create query parser
qp = new QueryParser(Version.LUCENE_29, "content", a);
 // initialize "filters" Hashtable
filters = new Hashtable<String, Filter>();
 }

protected void createDirectories(int length)
{
 // create specified number of RAM directories
dirs = new Directory[length];
 for(int i=0;i<length;i++)
dirs[i] = new RAMDirectory();
}

protected void createIndexes() throws Exception
{
/* create indexes for each directory.
 each index contains two documents.
every document contains one term, unique across all indexes, one term unique
across single index and one term common to all indexes
 */
for(int i=0;i<dirs.length;i++)
{
 IndexWriter iw = new IndexWriter(dirs[i], a, true,
IndexWriter.MaxFieldLength.LIMITED);

Document d = new Document();
 // unique id across all indexes
d.add(new Field("id", ""+(i*2+1), Field.Store.YES, Field.Index.NOT_ANALYZED,
Field.TermVector.YES));
 // unique id in a single indexes
d.add(new Field("docnumber", "1", Field.Store.YES, Field.Index.NOT_ANALYZED,
Field.TermVector.YES));
 // common word in all indexes
d.add(new Field("content", "common", Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.YES));
 iw.addDocument(d);

d = new Document();
// unique id across all indexes
 d.add(new Field("id", ""+(i*2+2), Field.Store.YES,
Field.Index.NOT_ANALYZED, Field.TermVector.YES));
// unique id in a single indexes
 d.add(new Field("docnumber", "2", Field.Store.YES,
Field.Index.NOT_ANALYZED, Field.TermVector.YES));
// common word in all indexes
 d.add(new Field("content", "common", Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.YES));
iw.addDocument(d);

iw.close();
}
}

protected void openSearchers() throws Exception
{
// open searches for every directory and save it in an array
 searchers = new Searcher[dirs.length];
for(int i=0;i<dirs.length;i++)
 searchers[i] = new IndexSearcher(IndexReader.open(dirs[i], true));
}

 protected Searcher getSearcher(int[] arr) throws Exception
{
// provides a ParallelMultiSearcher instance with the searchers lying at
index positions provided in the argument
 Searcher[] s = new Searcher[arr.length];
for(int i=0;i<arr.length;i++)
 s[i] = this.searchers[arr[i]];

return new ParallelMultiSearcher(s);
 }

protected ScoreDoc[] search(String query, String filter, Searcher s) throws
Exception
 {
Filter f = null;
if(filter != null)
 {
if(filters.containsKey(filter))
{
 System.out.println("Reusing filter for - " + filter);
f = filters.get(filter);
 }
else
{
 System.out.println("Creating new filter for - " + filter);
f = new CachingWrapperFilter(new QueryWrapperFilter(qp.parse(filter)));
 filters.put(filter, f);
}
}
 System.out.println("Query:("+query+"), Filter:("+filter+")");
return s.search(qp.parse(query), f, 1000).scoreDocs;
 }

public static void main(String[] args) throws Exception
{
 FilterTest ft = new FilterTest();
ft.startTest();
}

public void startTest()
{
try
 {
Query q;

createDirectories(3);
 createIndexes();
openSearchers();
Searcher s;
 ScoreDoc[] sd;

System.out.println("===================================");
 System.out.println("Fields of all the documents");
// creating searcher for all indexes
 s = getSearcher(new int[]{0,1,2});
// search all documents and their ids
 sd = search("+content:common", null, s);
for(int i=0;i<sd.length;i++)
 {
System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
docnumber:"+s.doc(sd[i].doc).get("docnumber"));
 }
System.out.println("\n\n");

System.out.println("===================================");
 System.out.println("Searching for documents in a single index. Filter will
be created and cached");
s = getSearcher(new int[]{0});
 sd = search("+content:common", "docnumber:1", s);
System.out.println("Hits:"+sd.length);
 for(int i=0;i<sd.length;i++)
{
System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
docnumber:"+s.doc(sd[i].doc).get("docnumber"));
 }
System.out.println("\n\n");

System.out.println("===================================");
 System.out.println("Searching for documents in a other indexes other than
previous search. Query and filter will be same. Filter will be reused");
 s = getSearcher(new int[]{1,2});
sd = search("+content:common", "docnumber:1", s);
 System.out.println("Hits:"+sd.length);
for(int i=0;i<sd.length;i++)
 {
System.out.println("\tid:"+s.doc(sd[i].doc).get("id")+",
docnumber:"+s.doc(sd[i].doc).get("docnumber"));
 }

}
catch(Exception e)
 {
e.printStackTrace();
}
 }
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
OUTPUT:
[samar@myserver java]$ java FilterTest
===================================
Fields of all the documents
Query:(+content:common), Filter:(null)
        id:1, docid:1
        id:2, docid:2
        id:3, docid:1
        id:4, docid:2
        id:5, docid:1
        id:6, docid:2



===================================
Searching for documents in a single index. Filter will be created and cached
Creating new filter for - docid:1
Query:(+content:common), Filter:(docid:1)
Hits:1
        id:1, docid:1



===================================
Searching for documents in indexes other than previous search. Query and
filter will be same. Filter will be reused
Reusing filter for - docid:1
Query:(+content:common), Filter:(docid:1)
Hits:2
        id:3, docid:1
        id:5, docid:1










On Wed, Nov 3, 2010 at 7:04 PM, Erick Erickson <er...@gmail.com>wrote:

> I'm assuming you're down in Lucene land. Unless somehow you've
> gotten 63 separate filters when you think you only have one, I don't
> think what you're doing will work. Or I'm failing to understand what
> you're doing at all.
>
> The problem is I expect each of your indexes starts with document
> 1. So your Filter is really a bit set keyed by Lucene document ID.
>
> So applying filter 2 to index 54 will NOT do what you want. What I
> suspect you're seeing is that applying your filter is producing enough
> results from index 54 (to continue my example) to fool you into
> thinking it's working.
>
> Try running the query with and without the filter on each of your indexes,
> perhaps as a control including a restrictive clause in the query
> to do the same thing your filter is doing. Or construct the filter new
> for comparison.... If the numbers continue to be the same, I clearly
> don't understand something! <G>....
>
> Best
> Erick
>
> On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <samarzone@gmail.com
> >wrote:
>
> > Hi. We have a large index (~ 28 GB) which is distributed in three
> different
> > directories, each representing a country. Each of these country wise
> > indexes
> > is further distributed on the basis of last update date into 21 smaller
> > indexes. This index is updated once in a day.
> >
> > A user can search into any of one country and can choose last update date
> > plus some other criteria.
> >
> > When the server application starts, index readers and hence searchers are
> > created for each of the small indexes (21 x 3) and put in an array.
> > Depending on the option (country and last update date) chosen by user we
> > pick the searchers of correct date range/country and create a new
> > ParallelMultiSearcher instance.
> >
> > Now my question is - can I use single filter (caching filter) instance
> for
> > every search (may be on different searchers)?
> >
> >
> ===================================================================================
> >
> > e.g
> > for first search i create an filter of experience 4 years and save it.
> >
> > if another search for a different country (and hence difference index)
> also
> > has same experience criteria, i.e. 4 years, can i use the same filter
> > instance for second search too?
> >
> > i have tested a little for this and surprisingly i have got correct
> > results.
> > i was wondering if this is the correct way. or do i need to create
> > different
> > filters for each searcher (or index reader) instance?
> >
> > Thanks in advance.
> >
> > --
> > Regards,
> > Samar
> >
>



-- 
Regards,
Samar

Re: Single filter instance with different searchers

Posted by Erick Erickson <er...@gmail.com>.

I'm assuming you're down in Lucene land. Unless somehow you've
gotten 63 separate filters when you think you only have one, I don't
think what you're doing will work. Or I'm failing to understand what
you're doing at all.

The problem is I expect each of your indexes starts with document
1. So your Filter is really a bit set keyed by Lucene document ID.

So applying filter 2 to index 54 will NOT do what you want. What I
suspect you're seeing is that applying your filter is producing enough
results from index 54 (to continue my example) to fool you into
thinking it's working.

Try running the query with and without the filter on each of your indexes,
perhaps as a control including a restrictive clause in the query
to do the same thing your filter is doing. Or construct the filter new
for comparison.... If the numbers continue to be the same, I clearly
don't understand something! <G>....

Best
Erick

On Wed, Nov 3, 2010 at 6:05 AM, Samarendra Pratap <sa...@gmail.com>wrote:

> Hi. We have a large index (~ 28 GB) which is distributed in three different
> directories, each representing a country. Each of these country wise
> indexes
> is further distributed on the basis of last update date into 21 smaller
> indexes. This index is updated once in a day.
>
> A user can search into any of one country and can choose last update date
> plus some other criteria.
>
> When the server application starts, index readers and hence searchers are
> created for each of the small indexes (21 x 3) and put in an array.
> Depending on the option (country and last update date) chosen by user we
> pick the searchers of correct date range/country and create a new
> ParallelMultiSearcher instance.
>
> Now my question is - can I use single filter (caching filter) instance for
> every search (may be on different searchers)?
>
> ===================================================================================
>
> e.g
> for first search i create an filter of experience 4 years and save it.
>
> if another search for a different country (and hence difference index) also
> has same experience criteria, i.e. 4 years, can i use the same filter
> instance for second search too?
>
> i have tested a little for this and surprisingly i have got correct
> results.
> i was wondering if this is the correct way. or do i need to create
> different
> filters for each searcher (or index reader) instance?
>
> Thanks in advance.
>
> --
> Regards,
> Samar
>