You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Edwin Yeo Zheng Lin (JIRA)" <ji...@apache.org> on 2019/02/13 08:48:00 UTC

[jira] [Comment Edited] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

    [ https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766679#comment-16766679 ] 

Edwin Yeo Zheng Lin edited comment on SOLR-13242 at 2/13/19 8:47 AM:
---------------------------------------------------------------------

So far we have not tried running in a different Java program other than the indexing in Solr and the regex101.com online emulator.

We tried both "(\s*\n)\{2,}" and "(\n\s*)\{2,}". Both patterns gives the same results.

If it includes newlines then it should have done what we expected? The original content does have newlines.


was (Author: edwinyeozl):
So far we have not tried running in a different Java program other than the indexing in Solr and the regex101.com online emulator.

We tried both "(\s*\n)\{2,}" and "(\n*\s)\{2,}". Both patterns gives the same results.

If it includes newlines then it should have done what we expected? The original content does have newlines.

> RegexReplaceProcessorFactory not making accurate replacement
> ------------------------------------------------------------
>
>                 Key: SOLR-13242
>                 URL: https://issues.apache.org/jira/browse/SOLR-13242
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.6
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: regex, solr
>
> We are using the RegexReplaceProcessorFactory with the following configuration
>  
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n)\{2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>  </processor>
>  
> The regex pattern of (\s*\n)\{2,} is working perfectly in [regex101.com|http://regex101.com/], in which all the \n will be replaced by only two <br>
> However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.
>  
>  
> Example 1: The sentence that the above regex pattern is working correctly 
> *Original content in EML file:*  
> Dear Sir, 
>  
> I am terminating 
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content:*     Dear Sir,  <br><br>I am terminating 
>  
> Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*    
> _exalted_
> _Psalm 89:17_
>  
> 3 Choa Chu Kang Avenue 4    
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore
> *Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore
>  
> Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*    
> [http://www.concordpri.moe.edu.sg/]
>  
>  
>  
>  
> On Tue, Dec 18, 2018 at 10:07 AM    
> *Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 
> *Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org