You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2009/07/16 20:14:14 UTC
[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms
the input regex doesn't actually match
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ]
Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---------------------------------------------------------------
bq. ... why is RegexQuery treating the trailing "." as a ".*" instead?
JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.
By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".
The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".
The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt().
was (Author: steve_rowe):
bq. ... why is RegexQuery treating the trailing "." as a ".*" instead?
JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.
By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".
The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".
The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt().
> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
> Key: LUCENE-1683
> URL: https://issues.apache.org/jira/browse/LUCENE-1683
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/*
> Affects Versions: 2.3.2
> Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...) It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
> @Test
> public void testNecessity() throws Exception {
> File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
> try {
> Document doc = new Document();
> doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
> writer.addDocument(doc);
> } finally {
> writer.close();
> }
> IndexReader reader = IndexReader.open(dir);
> try {
> TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
> assertEquals("Wrong term", "cats", terms.term());
> assertFalse("Should have only been one term", terms.next());
> } finally {
> reader.close();
> }
> }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
> String fixed = String.format("(?:%s)$", original);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org