You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christopher M Collins <co...@us.ibm.com> on 2008/08/01 04:06:55 UTC

SpanRegexQuery

Hello,

I'm trying to use SpanRegexQuery as one of the clauses in my SpanQuery.
When I give it a regex like: "L[a-z]+ing" and do a rewrite on the final
query I get terms like "Labinger" and "Lackonsingh" along with the expected
terms "Labeling", "Lacing", etc.  It's as if the regex is treated as a
"find()" and not a "match()" in Java.  Is there a way to make it behave
like a full match, and not a prefix regex?

Thanks!

Christopher

______________________________________________________________
Christopher Collins \ http://www.cs.utoronto.ca/~ccollins
Department of Computer Science \ University of Toronto
Collaborative User Experience Group \ IBM Research

Re: SpanRegexQuery

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 31, 2008, at 10:06 PM, Christopher M Collins wrote:
> I'm trying to use SpanRegexQuery as one of the clauses in my  
> SpanQuery.
> When I give it a regex like: "L[a-z]+ing" and do a rewrite on the  
> final
> query I get terms like "Labinger" and "Lackonsingh" along with the  
> expected
> terms "Labeling", "Lacing", etc.  It's as if the regex is treated as a
> "find()" and not a "match()" in Java.  Is there a way to make it  
> behave
> like a full match, and not a prefix regex?

There are two implementations of the regex engine built into  
SpanRegexQuery, one using Java's java.util.regex, the other using  
Jakarta Regexp.  The default implementation is java.util.regex, which  
matches like this:

   pattern.matcher(string).lookingAt()

And Jakarta Regexp matches like this:

   regexp.match(string)

I'm not sure myself the differences in these two without doing some  
tests, but certainly they should, ahem, match in at least the  
expectation of whether there is an implied ^string$ or not.  But at a  
quick glance the respective javadocs, it does seem like the  
java.util.regex implementation should be using  
pattern.matcher(string).matches() instead.  lookingAt() always starts  
at the beginning, so there is an implied ^string effect, but not so  
with the akarta Regexp implementation.

As Daniel mentioned, putting a $ at the end should do the trick, and  
seems to me that it really should be necessary... but so should ^ in  
front if you want it to start at the beginning and not match anywhere  
in the string.

Changing JavaUtilRegexCapabilities to use matches() seems like the  
right thing to do, but that'd break backwards compatibility.  *ugh*

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanRegexQuery

Posted by Daniel Noll <da...@nuix.com>.

Christopher M Collins wrote:
> Hello,
> 
> I'm trying to use SpanRegexQuery as one of the clauses in my SpanQuery.
> When I give it a regex like: "L[a-z]+ing" and do a rewrite on the final
> query I get terms like "Labinger" and "Lackonsingh" along with the expected
> terms "Labeling", "Lacing", etc.  It's as if the regex is treated as a
> "find()" and not a "match()" in Java.  Is there a way to make it behave
> like a full match, and not a prefix regex?

Have you tried appending $ onto the end of it?  I think we noticed the 
same issue with regex queries here and had to apply a workaround of that 
sort.

Daniel


-- 
Daniel Noll

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org