You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sirish Vadala <si...@gmail.com> on 2010/06/11 20:31:56 UTC

Problem using TopFieldCollector

Currently I am on Lucene 2.2, migrating to 2.9 before eventually plan to move
to 3.1.

In Lucene 2.2, I have a custom hit collector that does both filtering and
sorting my search results.

Let me put the functionality achieved. When a user includes advance search
criteria with text search, I execute a query to fetch the advance search
results. Now, I have this collection of advance search result ids, that need
to match with the text search result ids.

Successfully implemented in 2.2 using the below collector:

----------------------------------------------------------------------

public class ResultHitCollector extends TopFieldDocCollector{

	private IndexSearcher searcher;
	private AnalysisTextSearchBean textSearchBean;

	public ResultHitCollector(IndexSearcher searcher, int numHits,
			AnalysisTextSearchBean textSearchBean, Sort sort)
			throws IOException {
		super(searcher.getIndexReader(), sort, numHits);
		this.searcher = searcher;
		this.textSearchBean = textSearchBean;
	}

	public void collect(int doc, float score) {
		try {
			HashMap<String, String> advanceSearchIds = textSearchBean
					.getAdvanceSearchIds();
			if (advanceSearchBills == null) {
				super.collect(doc, score);
			} else {
				Document document = searcher.doc(doc);
				if (document != null) {
					String id = document.get(IFIELD_ID);
					if (advanceSearchIds.containsKey(id)) {
						super.collect(doc, score);
					}
				}
			}
		} catch (CorruptIndexException e) {
			System.err.println("ERROR: " + e.getMessage());
		} catch (IOException e) {
			System.err.println("ERROR: " + e.getMessage());
		}
	}
}

----------------------------------------------------------------------
Calling the hit collection as below:


String[] sortOrder = {IFIELD_YEAR, IFIELD_TYPE, IFIELD_NUM, IFIELD_SESSION};
Sort sort = new Sort(sortOrder);
ResultHitCollector collector = new ResultHitCollector(indexSearcher, 100000,
analysisTextSearchBean, sort);
indexSearcher.search(booleanQuery, collector);
ScoreDoc[] hitScore = collector.topDocs().scoreDocs;
Utility.Log(Level.INFO, this, "Hits found: "+ hitScore.length);

----------------------------------------------------------------------

Since in Lucene 2.9, TopFieldDocCollector is deprecated, the documentation
suggests me to use TopFieldCollector instead. I am not really sure how to
get this to work using this, as TopFieldCollector doesn't have any public
constructors to extend this object. Any hints on how to implement this would
be highly appreciated.

However, using Collector in 2.9, this can be achieved without sort the
functionality. I also need to sort the results after filtering the search.

Thanks.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-TopFieldCollector-tp889310p889310.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem using TopFieldCollector

Posted by Sirish Vadala <si...@gmail.com>.

I was able to get this whole thing to work using the delegation pattern. In
my custom collector object, internally delegate to the TopFieldCollector
after doing my custom processing.  

Thanks.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-TopFieldCollector-tp889310p897633.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem using TopFieldCollector

Posted by Sirish Vadala <si...@gmail.com>.

Thanks for the response.

Yeah, eventually I choose to extend the Collector method, since none of the
other collectors viz. TopFieldCollector, TopDocsCollector does allow me to
extend them and override.

I could not grasp what exactly the below means:

Rebecca Watson wrote:
> 
> i keep a copy of the current indexreader + docBase, and used the
> indexreader.document
> method to get the doc/field values required in the collect method.
> 
> note that the docBase is used to keep/get the global docid but the doc
> passed
> to the .collect method relates to the current indexreader. i.e.
> global-docid = docBase + docid
> 

The documentation says:

NOTE: The doc that is passed to the collect method is relative to the
current reader. If your collector needs to resolve this to the docID space
of the Multi*Reader, you must re-base it by recording the docBase from the
most recent setNextReader call.

In my current collect method implementation, I use searcher.doc(doc) as
shown below:

	public void collect(int doc) throws IOException {
		try {
                        ... ... ... ... ...
			Document document = searcher.doc(doc);
                        ... ... ... ... ...
		} catch (CorruptIndexException e) {
			System.err.println("ERROR: " + e.getMessage());
		} catch (IOException e) {
			System.err.println("ERROR: " + e.getMessage());
		}
	}

Does this mean I am getting a wrong document, since I am not adding the
docBase?

I will do some research and use a hitqueue to sort the records in collector.

Also I haven't done any profiling to see the difference in the cost due to
the filed-loads. I would profile the test case and check the difference in
cost. But in most likely case, I need to filter the records as per user
requirement, that eventually leads to field load.

One more discovery though :)
Now I cannot use collector.topDocs().scoreDocs; if I extend collector. Looks
like I have to save the records in another variable in the collector and
retrieve.

Changing the entire implementation and specification makes it difficult to
migrate.

Thanks anyway.

Rebecca Watson wrote:
> 
> you sort in the collector anyway? -- using a custom hitqueue? in which
> case you'd
> use the global docid in the hitqueue / any filters created through the
> collector.
> 
> also, we ended up having to re-engineer our system so that we didn't
> use field-loads
> as this was a bottleneck in our system... maybe you should think about
> merging
> the two advanced/text documents together even if this duplicates
> information in your
> index so you don't have to do field loads...
> not sure if you've profiled to see the difference in cost?
> 
> hope that helps,
> 
> bec :)
> 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-TopFieldCollector-tp889310p895092.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem using TopFieldCollector

Posted by Rebecca Watson <be...@gmail.com>.

sorry, also forgot to say -

the abstract Collector.setScorer method gives you the new
Scorer (for the new indexreader i.e. index) so keep a pointer to that too as you
look like you need to use it in your .collect method (the new Collector.collect
method is only given the docid now).

the javadocs discusses global docid/docid for current index in the example:
http://lucene.apache.org/java/2_9_0/api/all/index.html

bec :)

On 12 June 2010 10:52, Rebecca Watson <be...@gmail.com> wrote:
> hi,
>
> i had similar issues migrating to using the new collectors... we use a custom
> hitcollector too where we accessed document fields to aid in scoring docs.
>
> when migrating - i chose to extend the Collector class where:
> .collect method still extended pretty much as before
>
> in the new  abstract method setNextReader:
> public void setNextReader(IndexReader reader, int docBase) {
>     ...
> }
>
> i keep a copy of the current indexreader + docBase, and used the
> indexreader.document
> method to get the doc/field values required in the collect method.
>
> note that the docBase is used to keep/get the global docid but the doc passed
> to the .collect method relates to the current indexreader. i.e.
> global-docid = docBase + docid
>
> you sort in the collector anyway? -- using a custom hitqueue? in which
> case you'd
> use the global docid in the hitqueue / any filters created through the
> collector.
>
> also, we ended up having to re-engineer our system so that we didn't
> use field-loads
> as this was a bottleneck in our system... maybe you should think about merging
> the two advanced/text documents together even if this duplicates
> information in your
> index so you don't have to do field loads...
> not sure if you've profiled to see the difference in cost?
>
> hope that helps,
>
> bec :)
>
> On 12 June 2010 02:31, Sirish Vadala <si...@gmail.com> wrote:
>>
>> Currently I am on Lucene 2.2, migrating to 2.9 before eventually plan to move
>> to 3.1.
>>
>> In Lucene 2.2, I have a custom hit collector that does both filtering and
>> sorting my search results.
>>
>> Let me put the functionality achieved. When a user includes advance search
>> criteria with text search, I execute a query to fetch the advance search
>> results. Now, I have this collection of advance search result ids, that need
>> to match with the text search result ids.
>>
>> Successfully implemented in 2.2 using the below collector:
>>
>> ----------------------------------------------------------------------
>>
>> public class ResultHitCollector extends TopFieldDocCollector{
>>
>>        private IndexSearcher searcher;
>>        private AnalysisTextSearchBean textSearchBean;
>>
>>        public ResultHitCollector(IndexSearcher searcher, int numHits,
>>                        AnalysisTextSearchBean textSearchBean, Sort sort)
>>                        throws IOException {
>>                super(searcher.getIndexReader(), sort, numHits);
>>                this.searcher = searcher;
>>                this.textSearchBean = textSearchBean;
>>        }
>>
>>        public void collect(int doc, float score) {
>>                try {
>>                        HashMap<String, String> advanceSearchIds = textSearchBean
>>                                        .getAdvanceSearchIds();
>>                        if (advanceSearchBills == null) {
>>                                super.collect(doc, score);
>>                        } else {
>>                                Document document = searcher.doc(doc);
>>                                if (document != null) {
>>                                        String id = document.get(IFIELD_ID);
>>                                        if (advanceSearchIds.containsKey(id)) {
>>                                                super.collect(doc, score);
>>                                        }
>>                                }
>>                        }
>>                } catch (CorruptIndexException e) {
>>                        System.err.println("ERROR: " + e.getMessage());
>>                } catch (IOException e) {
>>                        System.err.println("ERROR: " + e.getMessage());
>>                }
>>        }
>> }
>>
>> ----------------------------------------------------------------------
>> Calling the hit collection as below:
>>
>>
>> String[] sortOrder = {IFIELD_YEAR, IFIELD_TYPE, IFIELD_NUM, IFIELD_SESSION};
>> Sort sort = new Sort(sortOrder);
>> ResultHitCollector collector = new ResultHitCollector(indexSearcher, 100000,
>> analysisTextSearchBean, sort);
>> indexSearcher.search(booleanQuery, collector);
>> ScoreDoc[] hitScore = collector.topDocs().scoreDocs;
>> Utility.Log(Level.INFO, this, "Hits found: "+ hitScore.length);
>>
>> ----------------------------------------------------------------------
>>
>> Since in Lucene 2.9, TopFieldDocCollector is deprecated, the documentation
>> suggests me to use TopFieldCollector instead. I am not really sure how to
>> get this to work using this, as TopFieldCollector doesn't have any public
>> constructors to extend this object. Any hints on how to implement this would
>> be highly appreciated.
>>
>> However, using Collector in 2.9, this can be achieved without sort the
>> functionality. I also need to sort the results after filtering the search.
>>
>> Thanks.
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-TopFieldCollector-tp889310p889310.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem using TopFieldCollector

Posted by Rebecca Watson <be...@gmail.com>.

hi,

i had similar issues migrating to using the new collectors... we use a custom
hitcollector too where we accessed document fields to aid in scoring docs.

when migrating - i chose to extend the Collector class where:
.collect method still extended pretty much as before

in the new  abstract method setNextReader:
public void setNextReader(IndexReader reader, int docBase) {
     ...
}

i keep a copy of the current indexreader + docBase, and used the
indexreader.document
method to get the doc/field values required in the collect method.

note that the docBase is used to keep/get the global docid but the doc passed
to the .collect method relates to the current indexreader. i.e.
global-docid = docBase + docid

you sort in the collector anyway? -- using a custom hitqueue? in which
case you'd
use the global docid in the hitqueue / any filters created through the
collector.

also, we ended up having to re-engineer our system so that we didn't
use field-loads
as this was a bottleneck in our system... maybe you should think about merging
the two advanced/text documents together even if this duplicates
information in your
index so you don't have to do field loads...
not sure if you've profiled to see the difference in cost?

hope that helps,

bec :)

On 12 June 2010 02:31, Sirish Vadala <si...@gmail.com> wrote:
>
> Currently I am on Lucene 2.2, migrating to 2.9 before eventually plan to move
> to 3.1.
>
> In Lucene 2.2, I have a custom hit collector that does both filtering and
> sorting my search results.
>
> Let me put the functionality achieved. When a user includes advance search
> criteria with text search, I execute a query to fetch the advance search
> results. Now, I have this collection of advance search result ids, that need
> to match with the text search result ids.
>
> Successfully implemented in 2.2 using the below collector:
>
> ----------------------------------------------------------------------
>
> public class ResultHitCollector extends TopFieldDocCollector{
>
>        private IndexSearcher searcher;
>        private AnalysisTextSearchBean textSearchBean;
>
>        public ResultHitCollector(IndexSearcher searcher, int numHits,
>                        AnalysisTextSearchBean textSearchBean, Sort sort)
>                        throws IOException {
>                super(searcher.getIndexReader(), sort, numHits);
>                this.searcher = searcher;
>                this.textSearchBean = textSearchBean;
>        }
>
>        public void collect(int doc, float score) {
>                try {
>                        HashMap<String, String> advanceSearchIds = textSearchBean
>                                        .getAdvanceSearchIds();
>                        if (advanceSearchBills == null) {
>                                super.collect(doc, score);
>                        } else {
>                                Document document = searcher.doc(doc);
>                                if (document != null) {
>                                        String id = document.get(IFIELD_ID);
>                                        if (advanceSearchIds.containsKey(id)) {
>                                                super.collect(doc, score);
>                                        }
>                                }
>                        }
>                } catch (CorruptIndexException e) {
>                        System.err.println("ERROR: " + e.getMessage());
>                } catch (IOException e) {
>                        System.err.println("ERROR: " + e.getMessage());
>                }
>        }
> }
>
> ----------------------------------------------------------------------
> Calling the hit collection as below:
>
>
> String[] sortOrder = {IFIELD_YEAR, IFIELD_TYPE, IFIELD_NUM, IFIELD_SESSION};
> Sort sort = new Sort(sortOrder);
> ResultHitCollector collector = new ResultHitCollector(indexSearcher, 100000,
> analysisTextSearchBean, sort);
> indexSearcher.search(booleanQuery, collector);
> ScoreDoc[] hitScore = collector.topDocs().scoreDocs;
> Utility.Log(Level.INFO, this, "Hits found: "+ hitScore.length);
>
> ----------------------------------------------------------------------
>
> Since in Lucene 2.9, TopFieldDocCollector is deprecated, the documentation
> suggests me to use TopFieldCollector instead. I am not really sure how to
> get this to work using this, as TopFieldCollector doesn't have any public
> constructors to extend this object. Any hints on how to implement this would
> be highly appreciated.
>
> However, using Collector in 2.9, this can be achieved without sort the
> functionality. I also need to sort the results after filtering the search.
>
> Thanks.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-TopFieldCollector-tp889310p889310.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org