You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/05/30 21:04:54 UTC

restricting hits to a subset of "id"s

I'm relatively new to Lucene-- I think I know the answer to this, but in 
case I'm wrong I'd like to know.

In our application we are looking for hits in a large corpus of resumes. 
The documents are rather simple:
simply a person id and a text field.  The query is something like 
a "more like this" application. This is working very well right out of the 
box. However my user would
like to input a subset of all the person ids and only return those hits 
that are among that list. This input 
list is likely to be many thousands of people. The people in this list 
won't fall into any obvious
categories by which this could be dealt with by an appropriate query to 
the index, if the people had
been tagged at indexing time. It will be essentially a "random" list of 
person ids.

My guess is that my user would really like the scoring to be done only 
considering that subset of person ids as
well but we haven't explicitly discussed it and I'm pretty sure that the 
scoring is based on information in
the entire index and can't be changed on the fly, correct?

In any case it seems to me that the "natural" way to only return people 
who are in the original input list
is to simply use Lucene as it is, getting all the hits I need, and then 
only returning out of the application those on 
the original input list. Does this seem appropriate?
Thanks in advance for any pointers--

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

Re: restricting hits to a subset of "id"s

Posted by Donna L Gresh <gr...@us.ibm.com>.

Thanks Yonik, this is working well (the BitSet and Filter option). It's
always helpful to have a pointer as to where to start--

>It's probably easier to use a Filter (which essentially does the same
>thing at a lower level in the search API).

>Use termDocs(Term) to look up the ids, add them to a BitSet, and make
>a Filter with that.
>You might want to check out CachingWrapperFilter or QueryFilter too.

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

Re: restricting hits to a subset of "id"s

Posted by Yonik Seeley <yo...@apache.org>.

On 5/30/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> My guess is that my user would really like the scoring to be done only
> considering that subset of person ids as
> well but we haven't explicitly discussed it and I'm pretty sure that the
> scoring is based on information in
> the entire index and can't be changed on the fly, correct?

Yes, term document frequency (used for idf - inverse document
frequency) will be based on the whole index.

> In any case it seems to me that the "natural" way to only return people
> who are in the original input list
> is to simply use Lucene as it is, getting all the hits I need, and then
> only returning out of the application those on
> the original input list. Does this seem appropriate?
> Thanks in advance for any pointers--

It's probably easier to use a Filter (which essentially does the same
thing at a lower level in the search API).

Use termDocs(Term) to look up the ids, add them to a BitSet, and make
a Filter with that.
You might want to check out CachingWrapperFilter or QueryFilter too.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org