You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sirish Vadala <si...@gmail.com> on 2010/04/13 22:59:13 UTC

Problem with search

Hello All,

I am kind of new to Lucene, and having problem filtering search results.

Background:

My Indexed documents have multiple bills and each bill has multiple
versions. 

Each version of the same bill has a different bill Version Id, but the same
bill Id. In most likely case, the text in different versions varies only
slightly. The text for all these versions indexed.

Problem:

Lets say, for a particular search term, if it is present in one version of
the bill, in most cases it is present in all other versions too. So the
users have come up with a requirement stating that they would like to see
only the latest bill version for the same bill having this search term.

So when I perform a search for a particular word, I might get different
versions of the same bill, but have to display only the latest record for
that bill. I did some research and understood that filters could be used to
implement this kind of requirement, however I am not sure how to proceed.

Any hints on how to implement this would be highly appreciated.

Thanks.
-- 
View this message in context: http://n3.nabble.com/Problem-with-search-tp717137p717137.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Problem with search

Posted by Sirish Vadala <si...@gmail.com>.
Hmmm... Seems like a lot of work to be done. I will try these options and
update.

Thanks a lot.

Best.
-- 
View this message in context: http://n3.nabble.com/Problem-with-search-tp717137p719604.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Problem with search

Posted by Shai Erera <se...@gmail.com>.
I don't know if that proposal is the most efficient one, but you can try it.
In general, what you're looking for is a GROUP BY Bill-Id feature and then
select the most recent one, right? Only you don't need all the Versions of
the same Bill, and therefore you can hold the most recent Version-Id only.
What you can do is write a Collector which for each received document checks
its Bill-Id and Version-Id. It keeps a Map Bill-Id -> Version-Id and for
every incoming doc checks the map:
1) If the Bill-Id hasn't been seen yet, stores it in the map.
2) If it has been seen, compares the Version-Id of the incoming doc to the
one in the map and replaces them if needed.

By storing the Bill-Id and Version-Id in the FieldCache you can make that
Collector work very fast. Also, you can apply some optimization to the
process by e.g. not checking the map if the document has no chance in being
selected for the top-K requested docs (for e.g. a low score) etc.

I've outlined a general approach .. other, perhaps more efficient ones, may
exist.

Another alternative is to run your search, collecting top-NK, where N is a
factor/multiplier you activate on K. After the search is done, you filter
out the unneeded docs w/ "old" Version-Id. If you choose your N smartly,
you'll do it just once, not re-running the query in case it filtered out too
many docs.

Hope this helps,
Shai

On Tue, Apr 13, 2010 at 11:59 PM, Sirish Vadala <si...@gmail.com>wrote:

>
> Hello All,
>
> I am kind of new to Lucene, and having problem filtering search results.
>
> Background:
>
> My Indexed documents have multiple bills and each bill has multiple
> versions.
>
> Each version of the same bill has a different bill Version Id, but the same
> bill Id. In most likely case, the text in different versions varies only
> slightly. The text for all these versions indexed.
>
> Problem:
>
> Lets say, for a particular search term, if it is present in one version of
> the bill, in most cases it is present in all other versions too. So the
> users have come up with a requirement stating that they would like to see
> only the latest bill version for the same bill having this search term.
>
> So when I perform a search for a particular word, I might get different
> versions of the same bill, but have to display only the latest record for
> that bill. I did some research and understood that filters could be used to
> implement this kind of requirement, however I am not sure how to proceed.
>
> Any hints on how to implement this would be highly appreciated.
>
> Thanks.
> --
> View this message in context:
> http://n3.nabble.com/Problem-with-search-tp717137p717137.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>