You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Trejkaz (JIRA)" <ji...@apache.org> on 2009/06/11 03:43:07 UTC

[jira] Created: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

RegexQuery matches terms the input regex doesn't actually match
---------------------------------------------------------------

                 Key: LUCENE-1683
                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
    Affects Versions: 2.3.2
            Reporter: Trejkaz


I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.

The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.

Here's a unit test for the behaviour I would expect myself:

    @Test
    public void testNecessity() throws Exception {
        File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
        IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
        try {
            Document doc = new Document();
            doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
            writer.addDocument(doc);
        } finally {
            writer.close();
        }

        IndexReader reader = IndexReader.open(dir);
        try {
            TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
            assertEquals("Wrong term", "cats", terms.term());
            assertFalse("Should have only been one term", terms.next());
        } finally {
            reader.close();
        }
    }

This test fails on the term check with terms.term() equal to "cathy".

Our workaround is to mangle the query like this:

    String fixed = String.format("(?:%s)$", original);


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Trejkaz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718270#action_12718270 ] 

Trejkaz commented on LUCENE-1683:
---------------------------------

I screwed up the formatting.  Fixed version:

{code}
    @Test
    public void testNecessity() throws Exception
    {
        File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
        IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
        Document doc = new Document();
        doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
        writer.addDocument(doc);
        writer.close();

        IndexReader reader = IndexReader.open(dir);

        TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
        assertEquals("Wrong term", "cats", terms.term().text());
        assertFalse("Should have only been one term", terms.next());
    }
{code}


> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ] 

Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM:
---------------------------------------------------------------

bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt().

      was (Author: steve_rowe):
    bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt().
  
> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1683.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.9

Thanks Trejkaz!

> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ] 

Steven Rowe commented on LUCENE-1683:
-------------------------------------

bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? 

JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.

By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".

The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".

The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt().

> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1683:
------------------------------------------

    Assignee: Michael McCandless

> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>            Assignee: Michael McCandless
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732050#action_12732050 ] 

Michael McCandless commented on LUCENE-1683:
--------------------------------------------

Do you have a proposed fix for this...?  Or, why is RegexQuery treating the trailing "." as a ".*" instead?

> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737620#action_12737620 ] 

Michael McCandless commented on LUCENE-1683:
--------------------------------------------

I agree this is a bug -- I'll switch to matches shortly.

> RegexQuery matches terms the input regex doesn't actually match
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1683
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1683
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.3.2
>            Reporter: Trejkaz
>            Assignee: Michael McCandless
>
> I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.
> The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...)  It is as if there is an implicit .* always added to the end of the regex.
> Here's a unit test for the behaviour I would expect myself:
>     @Test
>     public void testNecessity() throws Exception {
>         File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
>         IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
>         try {
>             Document doc = new Document();
>             doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
>             writer.addDocument(doc);
>         } finally {
>             writer.close();
>         }
>         IndexReader reader = IndexReader.open(dir);
>         try {
>             TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
>             assertEquals("Wrong term", "cats", terms.term());
>             assertFalse("Should have only been one term", terms.next());
>         } finally {
>             reader.close();
>         }
>     }
> This test fails on the term check with terms.term() equal to "cathy".
> Our workaround is to mangle the query like this:
>     String fixed = String.format("(?:%s)$", original);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org