You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jose Luna <jl...@usbmis.com> on 2007/12/11 19:30:09 UTC

Advice regarding fuzzy phrase searching

Hello,

I am looking for some advice regarding which tools I might use to solve 
my problem.  I apologize ahead of time for the long explanation.

Problem Description:  I would like to index a set of very large HTML 
documents.  I would then be able to run two different kinds of queries: 
proximity queries, and fuzzy phrase queries.   I would like to get the 
exact positions of the matching results from the query (I need to modify 
the original documents at these positions.)  I will only need to search 
one document at a time, i.e., I already know which document I'll be 
looking in, so what's important is finding the positions of the hits 
within that document.

For example,  for a fuzzy search, I may want to search for "arterial 
oxygen saturation".   I would want this to match "arterial oxygen 
saturate", and I would want to get the position of where it matches.  I 
would also like to do proximity searches, with these broken into 
separate terms.  So, I may be searching for "arterial", "oxygen", and 
"saturate" all within 10 terms of each other, and get the positions of 
the cases that match.

To the best of my understanding, Lucene is not a good choice to solve 
this problem (please correct me if I'm wrong).   As far as I can tell, 
Lucene breaks up a document into a set of terms, and indexes these in 
some sort of structure.  My guess is a B+ tree, but I'm curious to learn 
more about it -- I couldn't find much in the documentation about the 
underlying index structure.   Anyway, this means that the keys->pointer 
pairs in the index are basically term->documenID pairs.  So this isn't 
very suitable for my problem. I already know which document I want to 
search, I'm interested in the position of hits.    If I were to search 
for the phrase "arterial oxygen saturation", this would be broken into 
terms and I could iterate through all of the TermPositions for a given 
term in the document, and try to find out where these terms are adjacent 
in the document.  Considering that my document is very large, the 
phrases can be 10+ terms, and I need to do this hundreds of times, this 
doesn't sound like a very good solution.  If we introduce the idea of 
fuzzy matches and proximity searches, it seems like this task of 
iterating through TermPositions becomes very complicated. 

I've spent time reading the docs, creating a test program, and reading 
the mailing list.  As far as I can tell, Lucene is geared towards 
document based queries, and isn't the ideal tool for my problem.  I 
think an index based on a suffix tree (or variation of) would better 
meet my needs, but I'm not sure how well these perform with fuzzy and 
proximity searches.  I've looked around, and I can't seem to find a good 
opensource indexing framework like lucene that's based on a suffix 
tree.  Are there any suggestions for tools that would help with this 
problem?  Does anyone have any suggestions on how I might bend Lucene to 
meet my needs?

Thanks in advance,

JLuna

Re: Advice regarding fuzzy phrase searching

Posted by Ruslan Sivak <rs...@istandfor.com>.

Look into SpanNearQuery.  It has a slop which lets you say how close you 
want the terms to be.  For a single document, if you are going to be 
doing a lot of these searches, I recommend using a MemoryIndex.

Russ

Jose Luna wrote:
> Hello,
>
> I am looking for some advice regarding which tools I might use to 
> solve my problem.  I apologize ahead of time for the long explanation.
>
> Problem Description:  I would like to index a set of very large HTML 
> documents.  I would then be able to run two different kinds of 
> queries: proximity queries, and fuzzy phrase queries.   I would like 
> to get the exact positions of the matching results from the query (I 
> need to modify the original documents at these positions.)  I will 
> only need to search one document at a time, i.e., I already know which 
> document I'll be looking in, so what's important is finding the 
> positions of the hits within that document.
>
> For example,  for a fuzzy search, I may want to search for "arterial 
> oxygen saturation".   I would want this to match "arterial oxygen 
> saturate", and I would want to get the position of where it matches.  
> I would also like to do proximity searches, with these broken into 
> separate terms.  So, I may be searching for "arterial", "oxygen", and 
> "saturate" all within 10 terms of each other, and get the positions of 
> the cases that match.
>
> To the best of my understanding, Lucene is not a good choice to solve 
> this problem (please correct me if I'm wrong).   As far as I can tell, 
> Lucene breaks up a document into a set of terms, and indexes these in 
> some sort of structure.  My guess is a B+ tree, but I'm curious to 
> learn more about it -- I couldn't find much in the documentation about 
> the underlying index structure.   Anyway, this means that the 
> keys->pointer pairs in the index are basically term->documenID pairs.  
> So this isn't very suitable for my problem. I already know which 
> document I want to search, I'm interested in the position of hits.    
> If I were to search for the phrase "arterial oxygen saturation", this 
> would be broken into terms and I could iterate through all of the 
> TermPositions for a given term in the document, and try to find out 
> where these terms are adjacent in the document.  Considering that my 
> document is very large, the phrases can be 10+ terms, and I need to do 
> this hundreds of times, this doesn't sound like a very good solution.  
> If we introduce the idea of fuzzy matches and proximity searches, it 
> seems like this task of iterating through TermPositions becomes very 
> complicated.
> I've spent time reading the docs, creating a test program, and reading 
> the mailing list.  As far as I can tell, Lucene is geared towards 
> document based queries, and isn't the ideal tool for my problem.  I 
> think an index based on a suffix tree (or variation of) would better 
> meet my needs, but I'm not sure how well these perform with fuzzy and 
> proximity searches.  I've looked around, and I can't seem to find a 
> good opensource indexing framework like lucene that's based on a 
> suffix tree.  Are there any suggestions for tools that would help with 
> this problem?  Does anyone have any suggestions on how I might bend 
> Lucene to meet my needs?
>
> Thanks in advance,
>
> JLuna
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Advice regarding fuzzy phrase searching

Posted by Jose Luna <jl...@usbmis.com>.

Mark, Russ, thanks for the replies.

Mark, this looks great, I think it's exactly what I was looking for.  I 
think this should definitely be added to Lucene when it is stable 
enough.  I suspect there are others that would find it useful.

JLuna

Mark Miller wrote:
> Take a look at: https://issues.apache.org/jira/browse/LUCENE-794
>
> This is an extension to the Highlighter that highlights span and 
> proximity queries. If you rewrite the query it will also do fuzzy 
> queries. I am sure you can easily steal some of the code to do what 
> you want.
>
> Keep in mind, because of how Lucene's SpanQuery works, if you say to 
> find 'mark within 4 of ball', Lucene will not find all occurrences. 
> ie: 'mark close to ball ball' -- even if you say find mark within 20 
> of ball, a Span query will only find the first occurrence of ball even 
> though both occurrences are within 20. If ball was on both sides of 
> mark, both would match, but after finding the first ball with 20 of 
> mark, Span doesnt continue looking for another.
>
> - Mark
>
> Jose Luna wrote:
>> Hello,
>>
>> I am looking for some advice regarding which tools I might use to 
>> solve my problem.  I apologize ahead of time for the long explanation.
>>
>> Problem Description:  I would like to index a set of very large HTML 
>> documents.  I would then be able to run two different kinds of 
>> queries: proximity queries, and fuzzy phrase queries.   I would like 
>> to get the exact positions of the matching results from the query (I 
>> need to modify the original documents at these positions.)  I will 
>> only need to search one document at a time, i.e., I already know 
>> which document I'll be looking in, so what's important is finding the 
>> positions of the hits within that document.
>>
>> For example,  for a fuzzy search, I may want to search for "arterial 
>> oxygen saturation".   I would want this to match "arterial oxygen 
>> saturate", and I would want to get the position of where it matches.  
>> I would also like to do proximity searches, with these broken into 
>> separate terms.  So, I may be searching for "arterial", "oxygen", and 
>> "saturate" all within 10 terms of each other, and get the positions 
>> of the cases that match.
>>
>> To the best of my understanding, Lucene is not a good choice to solve 
>> this problem (please correct me if I'm wrong).   As far as I can 
>> tell, Lucene breaks up a document into a set of terms, and indexes 
>> these in some sort of structure.  My guess is a B+ tree, but I'm 
>> curious to learn more about it -- I couldn't find much in the 
>> documentation about the underlying index structure.   Anyway, this 
>> means that the keys->pointer pairs in the index are basically 
>> term->documenID pairs.  So this isn't very suitable for my problem. I 
>> already know which document I want to search, I'm interested in the 
>> position of hits.    If I were to search for the phrase "arterial 
>> oxygen saturation", this would be broken into terms and I could 
>> iterate through all of the TermPositions for a given term in the 
>> document, and try to find out where these terms are adjacent in the 
>> document.  Considering that my document is very large, the phrases 
>> can be 10+ terms, and I need to do this hundreds of times, this 
>> doesn't sound like a very good solution.  If we introduce the idea of 
>> fuzzy matches and proximity searches, it seems like this task of 
>> iterating through TermPositions becomes very complicated.
>> I've spent time reading the docs, creating a test program, and 
>> reading the mailing list.  As far as I can tell, Lucene is geared 
>> towards document based queries, and isn't the ideal tool for my 
>> problem.  I think an index based on a suffix tree (or variation of) 
>> would better meet my needs, but I'm not sure how well these perform 
>> with fuzzy and proximity searches.  I've looked around, and I can't 
>> seem to find a good opensource indexing framework like lucene that's 
>> based on a suffix tree.  Are there any suggestions for tools that 
>> would help with this problem?  Does anyone have any suggestions on 
>> how I might bend Lucene to meet my needs?
>>
>> Thanks in advance,
>>
>> JLuna
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Advice regarding fuzzy phrase searching

Posted by Mark Miller <ma...@gmail.com>.

Take a look at: https://issues.apache.org/jira/browse/LUCENE-794

This is an extension to the Highlighter that highlights span and 
proximity queries. If you rewrite the query it will also do fuzzy 
queries. I am sure you can easily steal some of the code to do what you 
want.

Keep in mind, because of how Lucene's SpanQuery works, if you say to 
find 'mark within 4 of ball', Lucene will not find all occurrences. ie: 
'mark close to ball ball' -- even if you say find mark within 20 of 
ball, a Span query will only find the first occurrence of ball even 
though both occurrences are within 20. If ball was on both sides of 
mark, both would match, but after finding the first ball with 20 of 
mark, Span doesnt continue looking for another.

- Mark

Jose Luna wrote:
> Hello,
>
> I am looking for some advice regarding which tools I might use to 
> solve my problem.  I apologize ahead of time for the long explanation.
>
> Problem Description:  I would like to index a set of very large HTML 
> documents.  I would then be able to run two different kinds of 
> queries: proximity queries, and fuzzy phrase queries.   I would like 
> to get the exact positions of the matching results from the query (I 
> need to modify the original documents at these positions.)  I will 
> only need to search one document at a time, i.e., I already know which 
> document I'll be looking in, so what's important is finding the 
> positions of the hits within that document.
>
> For example,  for a fuzzy search, I may want to search for "arterial 
> oxygen saturation".   I would want this to match "arterial oxygen 
> saturate", and I would want to get the position of where it matches.  
> I would also like to do proximity searches, with these broken into 
> separate terms.  So, I may be searching for "arterial", "oxygen", and 
> "saturate" all within 10 terms of each other, and get the positions of 
> the cases that match.
>
> To the best of my understanding, Lucene is not a good choice to solve 
> this problem (please correct me if I'm wrong).   As far as I can tell, 
> Lucene breaks up a document into a set of terms, and indexes these in 
> some sort of structure.  My guess is a B+ tree, but I'm curious to 
> learn more about it -- I couldn't find much in the documentation about 
> the underlying index structure.   Anyway, this means that the 
> keys->pointer pairs in the index are basically term->documenID pairs.  
> So this isn't very suitable for my problem. I already know which 
> document I want to search, I'm interested in the position of hits.    
> If I were to search for the phrase "arterial oxygen saturation", this 
> would be broken into terms and I could iterate through all of the 
> TermPositions for a given term in the document, and try to find out 
> where these terms are adjacent in the document.  Considering that my 
> document is very large, the phrases can be 10+ terms, and I need to do 
> this hundreds of times, this doesn't sound like a very good solution.  
> If we introduce the idea of fuzzy matches and proximity searches, it 
> seems like this task of iterating through TermPositions becomes very 
> complicated.
> I've spent time reading the docs, creating a test program, and reading 
> the mailing list.  As far as I can tell, Lucene is geared towards 
> document based queries, and isn't the ideal tool for my problem.  I 
> think an index based on a suffix tree (or variation of) would better 
> meet my needs, but I'm not sure how well these perform with fuzzy and 
> proximity searches.  I've looked around, and I can't seem to find a 
> good opensource indexing framework like lucene that's based on a 
> suffix tree.  Are there any suggestions for tools that would help with 
> this problem?  Does anyone have any suggestions on how I might bend 
> Lucene to meet my needs?
>
> Thanks in advance,
>
> JLuna
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org