You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Edwin Yeo Zheng Lin (JIRA)" <ji...@apache.org> on 2019/02/13 08:50:00 UTC

[jira] [Updated] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

     [ https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Edwin Yeo Zheng Lin updated SOLR-13242:
---------------------------------------
    Description: 
We are using the RegexReplaceProcessorFactory with the following configuration

 

 <processor class="solr.RegexReplaceProcessorFactory">

   <str name="fieldName">content</str>

   <str name="pattern">(\s*\n)\{2,}</str>

   <str name="replacement"><br><br></str>

 </processor>

 

The regex pattern of (\s*\n)\{2,} and (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in which all the \n will be replaced by only two <br>

However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.

 

 

Example 1: The sentence that the above regex pattern is working correctly 

*Original content in EML [file:*|file://%2A/]  

Dear Sir, 

 

I am terminating 

*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating

*Index content:*     Dear Sir,  <br><br>I am terminating 

 

Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML [file:*|file://%2A/]    

_exalted_

_Psalm 89:17_

 

3 Choa Chu Kang Avenue 4    

*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore

*Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore

 

Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML [file:*|file://%2A/]    

[http://www.concordpri.moe.edu.sg/]

 

 

 

 

On Tue, Dec 18, 2018 at 10:07 AM    

*Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 

*Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM

  was:
We are using the RegexReplaceProcessorFactory with the following configuration

 

 <processor class="solr.RegexReplaceProcessorFactory">

   <str name="fieldName">content</str>

   <str name="pattern">(\s*\n)\{2,}</str>

   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>

 </processor>

 

The regex pattern of (\s*\n)\{2,} is working perfectly in [regex101.com|http://regex101.com/], in which all the \n will be replaced by only two <br>

However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.

 

 

Example 1: The sentence that the above regex pattern is working correctly 

*Original content in EML file:*  

Dear Sir, 

 

I am terminating 

*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating

*Index content:*     Dear Sir,  <br><br>I am terminating 

 

Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML file:*    

_exalted_

_Psalm 89:17_

 

3 Choa Chu Kang Avenue 4    

*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore

*Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore

 

Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML file:*    

[http://www.concordpri.moe.edu.sg/]

 

 

 

 

On Tue, Dec 18, 2018 at 10:07 AM    

*Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 

*Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM


> RegexReplaceProcessorFactory not making accurate replacement
> ------------------------------------------------------------
>
>                 Key: SOLR-13242
>                 URL: https://issues.apache.org/jira/browse/SOLR-13242
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.6
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: regex, solr
>
> We are using the RegexReplaceProcessorFactory with the following configuration
>  
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n)\{2,}</str>
>    <str name="replacement"><br><br></str>
>  </processor>
>  
> The regex pattern of (\s*\n)\{2,} and (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in which all the \n will be replaced by only two <br>
> However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.
>  
>  
> Example 1: The sentence that the above regex pattern is working correctly 
> *Original content in EML [file:*|file://%2A/]  
> Dear Sir, 
>  
> I am terminating 
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content:*     Dear Sir,  <br><br>I am terminating 
>  
> Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML [file:*|file://%2A/]    
> _exalted_
> _Psalm 89:17_
>  
> 3 Choa Chu Kang Avenue 4    
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore
> *Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore
>  
> Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML [file:*|file://%2A/]    
> [http://www.concordpri.moe.edu.sg/]
>  
>  
>  
>  
> On Tue, Dec 18, 2018 at 10:07 AM    
> *Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 
> *Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org