You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sven Schmeier <sv...@schmeier.com> on 2006/01/14 17:06:22 UTC

Part-Of Match

Hi Folks,
 
I have the following problem:
We have a very large list of special words or phrases that should match if
they occur in a document. The idea was to fill the index with all these
phrases and use the document as the query. Then we expect a 100% match for
phrases that occur "exact" in the documents, for others the score should be
less than 1.
 
Example: 
 
Querydocument: 
 von Willebrand factor (vWF) is a large multimeric glycoprotein synthesized
exclusively by endothelial cells and megakaryocytes (1). 
 
 
Index contains the following phrases:
 
(1) von Willebrand
(2) glycoprotein
(3) endothelial glycoprotein
(4) multimeric megakaryocytes
 
 
So the result should be:
(1) Score: 1
(2) Score: 1
(3) Score: less than 1
(4) Score: less than 1
 
Is there any way of doing this with lucene?
 
Thanks and best whishes,
Sven

Re: AW: Part-Of Match

Posted by Chris Hostetter <ho...@fucit.org>.

: >>von Willebrand<< is not the query but a document in the index.... The task
: is to detect exact matches of phrases inside a query (large document) with
: these phrases stored in the index.

Lemme see if i can restate your problem...

You want to build a data repository in which you insert a large magnatude
of "concepts" where a concept is a short phrase consisting of a few words
(possibly just one word).  The words in any given concept phrase may
overlap (or be a super set) of the words in other concepts.

Once this concept repository is built, you want to to build a black box
arround it, such that people can hand your black box a "document"
(ie: a research paper, a newpaper article, a short story, ...
some text consisting of many many sentences) and you want your black box
to then return the list of concepts that match the input document, such
that the cnceptss with the highest score are concepts whose phrase appears
exactly in the input document.  Concepts whose phrase doesn't appear
exactly in the document shoudl still be returned, but with a lower score
based on how many words in the concept's phrase are found in the input
document.

	(have i adequetly described your problem?)

It's an interesting idea.  can it be done with lucene? ... i can think of
one kludgy mechanism for doing it but i'd be very suprised if there isn't
a better way (or if there is some other software library out there that
would be more suited)

Build a permentant index in which each concept is a Lucene Document.
these documents really only need one stored/tokenized/indexed field
containing the phrase (if you want other payload fields that's up to you).

Each time you are asked to analyze a Text sample and return matching
phrases, run the text through your analyzer to get back a tokenstream, and
for each of those tokens, use a TermDocs iterator to find out if any
phrase in your concept index contains that term, and if so which ones.
(you could also do this by building a boolean OR query out of all the
words in your input document -- but that may run into performance
limitatios if your input docs are too big, and it will try to score each
concept which isn't neccessary so even for short input text it's less
efficient).

Now you have an (unordered) list of concepts that have something to do
with your input text.

Next build a RAMDirectory based index consisting of exactly one document
which you build from the input text.  Loop over that list of concepts you
got, and build a boolean query out of each one along the lines that
Daniel described: a phrase query on the whole concept phrase along with
term queries for each individual word -- all optional.  run each of these
boolean queries against your one document RAMDirectory.  the higher the
score, the better that concept applies to your input text.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Part-Of Match

Posted by Sven <sv...@schmeier.de>.

Hi Daniel,

>>von Willebrand<< is not the query but a document in the index.... The task
is to detect exact matches of phrases inside a query (large document) with
these phrases stored in the index.

Cheers,
Sven



> -----Ursprüngliche Nachricht-----
> Von: Daniel Naber [mailto:lucenelist2005@danielnaber.de] 
> Gesendet: Sonntag, 15. Januar 2006 14:01
> An: java-user@lucene.apache.org
> Betreff: Re: Part-Of Match
> 
> 
> On Samstag 14 Januar 2006 17:06, Sven Schmeier wrote:
> 
> > (1) von Willebrand
> 
> You can rewrite the query like this:
> 
> "von Willebrand"^10 OR (von AND Willebrand)
> 
> Documents with the phrase "von Willebrand" will then score 
> higher than 
> those where "von" and "Willebrand" appear at different places.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Part-Of Match

Posted by Daniel Naber <lu...@danielnaber.de>.

On Samstag 14 Januar 2006 17:06, Sven Schmeier wrote:

> (1) von Willebrand

You can rewrite the query like this:

"von Willebrand"^10 OR (von AND Willebrand)

Documents with the phrase "von Willebrand" will then score higher than 
those where "von" and "Willebrand" appear at different places.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org