You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kuri Masta (JIRA)" <ji...@apache.org> on 2010/08/27 14:37:53 UTC

[jira] Created: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

regular expression in PatternReplaceFilter can handle: /([^/]*)
---------------------------------------------------------------

                 Key: SOLR-2093
                 URL: https://issues.apache.org/jira/browse/SOLR-2093
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis
    Affects Versions: 1.4
         Environment: debian,JRE1.6,solr1.4
            Reporter: Kuri Masta
            Priority: Minor


Using PatternReplaceFilter i want to extract a certain word out of the URI.
Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.

Viewing the source code, I don't see any problems since it uses the java library.

The problem:
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
                        pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
      </analyzer>

Input text:
- a/b/c

Expected
- b

Result Solr
- ab

An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
- b

So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

Posted by "Kuri Masta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904140#action_12904140 ] 

Kuri Masta commented on SOLR-2093:
----------------------------------

With a/b/c as input

You'll notice that I start searching from the end of the line.
1(a$).        match everything to the left until: /
2(/).           match /
3($1 = b). Repeat the previous but capture the match
4.(/)           match /

I wouldn't even know how to write regexp so it will concatenate two seperate matches, divided by '/', into one var.

Before I posted I've tried two regexp tools besides Solr.

I would like you to try again. But please keep in mind that I don't need this fix, I just found a bug and am reporting it.

> regular expression in PatternReplaceFilter can handle: /([^/]*)
> ---------------------------------------------------------------
>
>                 Key: SOLR-2093
>                 URL: https://issues.apache.org/jira/browse/SOLR-2093
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>         Environment: debian,JRE1.6,solr1.4
>            Reporter: Kuri Masta
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Using PatternReplaceFilter i want to extract a certain word out of the URI.
> Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.
> Viewing the source code, I don't see any problems since it uses the java library.
> The problem:
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>                         pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
>       </analyzer>
> Input text:
> - a/b/c
> Expected
> - b
> Result Solr
> - ab
> An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
> - b
> So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916307#action_12916307 ] 

Hoss Man commented on SOLR-2093:
--------------------------------

bq. I would like you to try again. But please keep in mind that I don't need this fix, I just found a bug and am reporting it.

I see no bug here.

 As Koji described, even using the online regex tool you provided you can see these exact results.

Input...
{noformat}
Regular Expression: /([^/]*)/[^/]*$
Replacement: $1
Test String #1: a/b/c
{noformat}

Output...
{noformat}
...
replaceAll(): ab
...
group(0): /b/c
group(1): b
{noformat}

bq. I wouldn't even know how to write regexp so it will concatenate two seperate matches

I don't think you understand the regex you provided.  I don't believe there are two matches, I believe there is one match (refered to in your online tool as "group(0)"), and that entire match is replaced by the first parenthetical group (refered to in your online tool as "group(1)").




> regular expression in PatternReplaceFilter can handle: /([^/]*)
> ---------------------------------------------------------------
>
>                 Key: SOLR-2093
>                 URL: https://issues.apache.org/jira/browse/SOLR-2093
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>         Environment: debian,JRE1.6,solr1.4
>            Reporter: Kuri Masta
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Using PatternReplaceFilter i want to extract a certain word out of the URI.
> Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.
> Viewing the source code, I don't see any problems since it uses the java library.
> The problem:
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>                         pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
>       </analyzer>
> Input text:
> - a/b/c
> Expected
> - b
> Result Solr
> - ab
> An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
> - b
> So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916309#action_12916309 ] 

Hoss Man commented on SOLR-2093:
--------------------------------

Note: Part of your confusion may lie in the meaning behind {{replace="all"}} ... this doesn't mean replace the entire Token, this means replace all matches of the regex with the replacement value -- so the pattern will be evaluated over and over against the input string (starting at the end of the last match) until it no longer matches, and each match will result in a replacement.

If you want the entire input Token to be replaced by the parenthetical group, you need to anchor your regex at both ends.  This should work..

{noformat}
<filter class="solr.PatternReplaceFilterFactory"
        pattern="^.*/([^/]*)/[^/]*$" replacement="$1" replace="all" />
{noformat}

> regular expression in PatternReplaceFilter can handle: /([^/]*)
> ---------------------------------------------------------------
>
>                 Key: SOLR-2093
>                 URL: https://issues.apache.org/jira/browse/SOLR-2093
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>         Environment: debian,JRE1.6,solr1.4
>            Reporter: Kuri Masta
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Using PatternReplaceFilter i want to extract a certain word out of the URI.
> Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.
> Viewing the source code, I don't see any problems since it uses the java library.
> The problem:
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>                         pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
>       </analyzer>
> Input text:
> - a/b/c
> Expected
> - b
> Result Solr
> - ab
> An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
> - b
> So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man resolved SOLR-2093.
----------------------------

    Resolution: Not A Problem

> regular expression in PatternReplaceFilter can handle: /([^/]*)
> ---------------------------------------------------------------
>
>                 Key: SOLR-2093
>                 URL: https://issues.apache.org/jira/browse/SOLR-2093
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>         Environment: debian,JRE1.6,solr1.4
>            Reporter: Kuri Masta
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Using PatternReplaceFilter i want to extract a certain word out of the URI.
> Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.
> Viewing the source code, I don't see any problems since it uses the java library.
> The problem:
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>                         pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
>       </analyzer>
> Input text:
> - a/b/c
> Expected
> - b
> Result Solr
> - ab
> An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
> - b
> So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2093) regular expression in PatternReplaceFilter can handle: /([^/]*)

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904086#action_12904086 ] 

Koji Sekiguchi commented on SOLR-2093:
--------------------------------------

{quote}
An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
* b
{quote}

I tried Java regex tester, but the result of it was same as Solr result, i.e. "ab". Please look at replaceAll(), not group(1).

> regular expression in PatternReplaceFilter can handle: /([^/]*)
> ---------------------------------------------------------------
>
>                 Key: SOLR-2093
>                 URL: https://issues.apache.org/jira/browse/SOLR-2093
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4
>         Environment: debian,JRE1.6,solr1.4
>            Reporter: Kuri Masta
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Using PatternReplaceFilter i want to extract a certain word out of the URI.
> Although I now understand that I should handle this outside of Solr, the fact remains that Solr does not adequately handle regular expressions.
> Viewing the source code, I don't see any problems since it uses the java library.
> The problem:
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>                         pattern="/([^/]*)/[^/]*$" replacement="$1"  replace="all" />
>       </analyzer>
> Input text:
> - a/b/c
> Expected
> - b
> Result Solr
> - ab
> An online JAVA regexp tester (http://www.regexplanet.com/simple/index.html):
> - b
> So the problem area lies at /([^/])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org