You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Timothy Potter <th...@gmail.com> on 2012/06/24 02:11:40 UTC

Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):

<charFilter class="solr.PatternReplaceCharFilterFactory"
            pattern="(\w)\1{2,}+"
            replaceWith="$1$1"/>

The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.

fooobarrrr  =>  foobarr

Using the analysis form, I end up with: fba

Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\w)\1{2,}+"
           replaceWith="$1$1"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:

[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr

Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.

Cheers,
Tim

>>>> Regex.java  <<<<

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
    public static void main(String[] args) throws Exception {
        String toCompile = args[0];
        Pattern p = Pattern.compile(toCompile);
        System.out.println("result: "+processPattern(p, args[1], args[2]));
    }

  // borrowed from PatternReplaceCharFilter.java
  private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
    final Matcher m = pattern.matcher(input);

    final StringBuffer cumulativeOutput = new StringBuffer();
    int cumulative = 0;
    int lastMatchEnd = 0;
    while (m.find()) {
      final int groupSize = m.end() - m.start();
      final int skippedSize = m.start() - lastMatchEnd;
      lastMatchEnd = m.end();
      final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
      m.appendReplacement(cumulativeOutput, replacement);
      final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
      if (groupSize != replacementSize) {
        if (replacementSize < groupSize) {
          cumulative += groupSize - replacementSize;
          int atIndex = lengthBeforeReplacement + replacementSize;
          //System.err.println(atIndex + "!" + cumulative);
          //addOffCorrectMap(atIndex, cumulative);
        }
      }
    }
    m.appendTail(cumulativeOutput);
    return cumulativeOutput;
  }
}

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yeah, it was kind of unfortunate that the posted example in SOLR-1653 used 
"replaceWith" but the committed code used "replacement". The detailed 
commentary on the issue notes the change, but the change occurred between 
the last posted patch and the commit. The source code and javadoc "rule", 
but we tend to assume that the Jira is more accurate than it necessarily is.

-- Jack Krupansky

-----Original Message----- 
From: Timothy Potter
Sent: Sunday, June 24, 2012 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Having an issue with the solr.PatternReplaceCharFilterFactory 
not replacing characters correctly

Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories

Cheers,
Tim

On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky <ja...@basetechnology.com> 
wrote:
> The char filter's attribute name is "replacement", not "replaceWith". I
> tried it and it seems to work fine (with Solr 3.6).
>
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>  pattern="(\w)\1{2,}+"
>  replacement="$1$1"/>
>
> See:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Timothy Potter
> Sent: Saturday, June 23, 2012 7:11 PM
> To: solr-user@lucene.apache.org
> Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
> replacing characters correctly
>
>
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>           pattern="(\w)\1{2,}+"
>           replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> fooobarrrr  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full <fieldType> definition (just the one addition of the
> leading <charFilter>):
>
>   <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
>          pattern="(\w)\1{2,}+"
>          replaceWith="$1$1"/>
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
>>>>> Regex.java  <<<<
>
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>   public static void main(String[] args) throws Exception {
>       String toCompile = args[0];
>       Pattern p = Pattern.compile(toCompile);
>       System.out.println("result: "+processPattern(p, args[1], args[2]));
>   }
>
>  // borrowed from PatternReplaceCharFilter.java
>  private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>   final Matcher m = pattern.matcher(input);
>
>   final StringBuffer cumulativeOutput = new StringBuffer();
>   int cumulative = 0;
>   int lastMatchEnd = 0;
>   while (m.find()) {
>     final int groupSize = m.end() - m.start();
>     final int skippedSize = m.start() - lastMatchEnd;
>     lastMatchEnd = m.end();
>     final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>     m.appendReplacement(cumulativeOutput, replacement);
>     final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>     if (groupSize != replacementSize) {
>       if (replacementSize < groupSize) {
>         cumulative += groupSize - replacementSize;
>         int atIndex = lengthBeforeReplacement + replacementSize;
>         //System.err.println(atIndex + "!" + cumulative);
>         //addOffCorrectMap(atIndex, cumulative);
>       }
>     }
>   }
>   m.appendTail(cumulativeOutput);
>   return cumulativeOutput;
>  }
> }

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Posted by Timothy Potter <th...@gmail.com>.

Awesome find Jack - thanks! Copied the "replaceWith" bit from
http://lucidworks.lucidimagination.com/display/solr/CharFilterFactories

Cheers,
Tim

On Sat, Jun 23, 2012 at 8:16 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> The char filter's attribute name is "replacement", not "replaceWith". I
> tried it and it seems to work fine (with Solr 3.6).
>
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>  pattern="(\w)\1{2,}+"
>  replacement="$1$1"/>
>
> See:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Timothy Potter
> Sent: Saturday, June 23, 2012 7:11 PM
> To: solr-user@lucene.apache.org
> Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not
> replacing characters correctly
>
>
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>           pattern="(\w)\1{2,}+"
>           replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> fooobarrrr  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full <fieldType> definition (just the one addition of the
> leading <charFilter>):
>
>   <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <charFilter class="solr.PatternReplaceCharFilterFactory"
>          pattern="(\w)\1{2,}+"
>          replaceWith="$1$1"/>
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>       <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
>>>>> Regex.java  <<<<
>
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>   public static void main(String[] args) throws Exception {
>       String toCompile = args[0];
>       Pattern p = Pattern.compile(toCompile);
>       System.out.println("result: "+processPattern(p, args[1], args[2]));
>   }
>
>  // borrowed from PatternReplaceCharFilter.java
>  private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>   final Matcher m = pattern.matcher(input);
>
>   final StringBuffer cumulativeOutput = new StringBuffer();
>   int cumulative = 0;
>   int lastMatchEnd = 0;
>   while (m.find()) {
>     final int groupSize = m.end() - m.start();
>     final int skippedSize = m.start() - lastMatchEnd;
>     lastMatchEnd = m.end();
>     final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>     m.appendReplacement(cumulativeOutput, replacement);
>     final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>     if (groupSize != replacementSize) {
>       if (replacementSize < groupSize) {
>         cumulative += groupSize - replacementSize;
>         int atIndex = lengthBeforeReplacement + replacementSize;
>         //System.err.println(atIndex + "!" + cumulative);
>         //addOffCorrectMap(atIndex, cumulative);
>       }
>     }
>   }
>   m.appendTail(cumulativeOutput);
>   return cumulativeOutput;
>  }
> }

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Posted by Jack Krupansky <ja...@basetechnology.com>.

The char filter's attribute name is "replacement", not "replaceWith". I 
tried it and it seems to work fine (with Solr 3.6).

<charFilter class="solr.PatternReplaceCharFilterFactory"
   pattern="(\w)\1{2,}+"
   replacement="$1$1"/>

See:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

-- Jack Krupansky

-----Original Message----- 
From: Timothy Potter
Sent: Saturday, June 23, 2012 7:11 PM
To: solr-user@lucene.apache.org
Subject: Having an issue with the solr.PatternReplaceCharFilterFactory not 
replacing characters correctly

Using 3.5 (also tried trunk), I have the following charFilter defined
on my fieldType (just extended text_general to keep things simple):

<charFilter class="solr.PatternReplaceCharFilterFactory"
            pattern="(\w)\1{2,}+"
            replaceWith="$1$1"/>

The intent of this charFilter is to match any characters that are
repeated in a string more than twice and collapse down to a max of
two, i.e.

fooobarrrr  =>  foobarr

Using the analysis form, I end up with: fba

Here is the full <fieldType> definition (just the one addition of the
leading <charFilter>):

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.PatternReplaceCharFilterFactory"
           pattern="(\w)\1{2,}+"
           replaceWith="$1$1"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

It seems like my regex and replacement strategy should work ... to
prove it, I wrote a little Regex.java class in which I borrowed some
from the PatternReplaceCharFilter class ... when I execute the
following with my little hack, I get the expected results:

[~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
result: foobarr

Is this a known issue or does anyone know how to work-around this? If
not, I'll open a JIRA but wanted to check here first.

Cheers,
Tim

>>>> Regex.java  <<<<

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {
    public static void main(String[] args) throws Exception {
        String toCompile = args[0];
        Pattern p = Pattern.compile(toCompile);
        System.out.println("result: "+processPattern(p, args[1], args[2]));
    }

  // borrowed from PatternReplaceCharFilter.java
  private static CharSequence processPattern(Pattern pattern,
CharSequence input, String replacement) {
    final Matcher m = pattern.matcher(input);

    final StringBuffer cumulativeOutput = new StringBuffer();
    int cumulative = 0;
    int lastMatchEnd = 0;
    while (m.find()) {
      final int groupSize = m.end() - m.start();
      final int skippedSize = m.start() - lastMatchEnd;
      lastMatchEnd = m.end();
      final int lengthBeforeReplacement = cumulativeOutput.length() +
skippedSize;
      m.appendReplacement(cumulativeOutput, replacement);
      final int replacementSize = cumulativeOutput.length() -
lengthBeforeReplacement;
      if (groupSize != replacementSize) {
        if (replacementSize < groupSize) {
          cumulative += groupSize - replacementSize;
          int atIndex = lengthBeforeReplacement + replacementSize;
          //System.err.println(atIndex + "!" + cumulative);
          //addOffCorrectMap(atIndex, cumulative);
        }
      }
    }
    m.appendTail(cumulativeOutput);
    return cumulativeOutput;
  }
}

Re: Having an issue with the solr.PatternReplaceCharFilterFactory not replacing characters correctly

Posted by Lance Norskog <go...@gmail.com>.

Please 1) make sure with a separate program that these are the right
Java regex patterns, and 2) write a unit test with all of the cases
you expect this to handle. Then file a JIRA with the unit test code.

On Sat, Jun 23, 2012 at 5:11 PM, Timothy Potter <th...@gmail.com> wrote:
> Using 3.5 (also tried trunk), I have the following charFilter defined
> on my fieldType (just extended text_general to keep things simple):
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>             pattern="(\w)\1{2,}+"
>             replaceWith="$1$1"/>
>
> The intent of this charFilter is to match any characters that are
> repeated in a string more than twice and collapse down to a max of
> two, i.e.
>
> fooobarrrr  =>  foobarr
>
> Using the analysis form, I end up with: fba
>
> Here is the full <fieldType> definition (just the one addition of the
> leading <charFilter>):
>
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
>            pattern="(\w)\1{2,}+"
>            replaceWith="$1$1"/>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> It seems like my regex and replacement strategy should work ... to
> prove it, I wrote a little Regex.java class in which I borrowed some
> from the PatternReplaceCharFilter class ... when I execute the
> following with my little hack, I get the expected results:
>
> [~/dev]$ java Regex "(\\w)\\1{2,}+" fooobarrrr "\$1\$1"
> result: foobarr
>
> Is this a known issue or does anyone know how to work-around this? If
> not, I'll open a JIRA but wanted to check here first.
>
> Cheers,
> Tim
>
>>>>> Regex.java  <<<<
>
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
>
> public class Regex {
>     public static void main(String[] args) throws Exception {
>         String toCompile = args[0];
>         Pattern p = Pattern.compile(toCompile);
>         System.out.println("result: "+processPattern(p, args[1], args[2]));
>     }
>
>   // borrowed from PatternReplaceCharFilter.java
>   private static CharSequence processPattern(Pattern pattern,
> CharSequence input, String replacement) {
>     final Matcher m = pattern.matcher(input);
>
>     final StringBuffer cumulativeOutput = new StringBuffer();
>     int cumulative = 0;
>     int lastMatchEnd = 0;
>     while (m.find()) {
>       final int groupSize = m.end() - m.start();
>       final int skippedSize = m.start() - lastMatchEnd;
>       lastMatchEnd = m.end();
>       final int lengthBeforeReplacement = cumulativeOutput.length() +
> skippedSize;
>       m.appendReplacement(cumulativeOutput, replacement);
>       final int replacementSize = cumulativeOutput.length() -
> lengthBeforeReplacement;
>       if (groupSize != replacementSize) {
>         if (replacementSize < groupSize) {
>           cumulative += groupSize - replacementSize;
>           int atIndex = lengthBeforeReplacement + replacementSize;
>           //System.err.println(atIndex + "!" + cumulative);
>           //addOffCorrectMap(atIndex, cumulative);
>         }
>       }
>     }
>     m.appendTail(cumulativeOutput);
>     return cumulativeOutput;
>   }
> }



-- 
Lance Norskog
goksron@gmail.com