You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2019/02/07 13:08:15 UTC

RegexReplaceProcessorFactory pattern to detect multiple \n

Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two <br>.

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:

<updateRequestProcessorChain name="removeCode">
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">"(\\n\s*){2,}"</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>
          </updateRequestProcessorChain>

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Jörn Franke <jo...@gmail.com>.
Maybe they work properly and the regex is not as expected? 

> Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> Thanks for the reply.
> 
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
> 
> Yes, our plan is to replace more than one \n with <br><br>, and single \n
> with single <br>.
> 
> Regards,
> Edwin
> 
>> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> Solr uses Java regex matching, so i doubt there is a bug - it would then
>> be in the JDK. Try out in a regex online Tool that supports Java regex for
>> your solution.
>> 
>> I believe you want to have 2 regex process factories:
>> One that deals with single \n and one that deals with more than one \n
>> 
>>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> :
>>> 
>>> Hi,
>>> 
>>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> configuration:
>>> 
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>  <str name="fieldName">content</str>
>>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>  <bool name="literalReplacement">true</bool>
>>> </processor>
>>> 
>>> However, the issue is still occurring.
>>> 
>>> Anyone else is able to help?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Should we report this as a bug in Solr?
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Paul,
>>>>>> 
>>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>>>>> https://regex101.com/, it is able to give us the correct result for
>> all
>>>>>> the examples (ie: All of them will only have <br><br>, and not more
>> than
>>>>>> that like what we are getting in Solr in our earlier examples).
>>>>>> 
>>>>>> Could there be a possibility of a bug in Solr?
>>>>>> 
>>>>>> Regards,
>>>>>> Edwin
>>>>>> 
>>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>>>> 
>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>  <str name="fieldName">content</str>
>>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>> </processor>
>>>>>>> 
>>>>>>> However, we are also getting the exact same results as the earlier
>>>>>>> Example 1, 2 and 3.
>>>>>>> 
>>>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>>>> printing) characters than \n, we have find that there are no non
>> printing
>>>>>>> characters. It is just next line with a space. You can refer to the
>>>>>>> original content in the same examples below.
>>>>>>> 
>>>>>>> 
>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>> correctly
>>>>>>> *Original content in EML file:*
>>>>>>> Dear Sir,
>>>>>>> 
>>>>>>> 
>>>>>>> I am terminating
>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>> 
>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> *exalted*
>>>>>>> 
>>>>>>> *Psalm 89:17*
>>>>>>> 
>>>>>>> 
>>>>>>> 3 Choa Chu Kang Avenue 4
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> http://www.concordpri.moe.edu.sg/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>> \n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> Dec 18,
>>>>>>> 2018 at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>>>>>>> 
>>>>>>>> Hi Edwin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
>>>>>>>> than \n?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> We have tried this suggested regex pattern as follow:
>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>> </processor>
>>>>>>>> 
>>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
>> below.
>>>>>>>> 
>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>>> 
>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> 
>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>>>>> \n \n\n
>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>> 18,
>>>>>>>> 2018
>>>>>>>> at 10:07 AM
>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>>> <br><br>On
>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>> 
>>>>>>>> Any further suggestion?
>>>>>>>> 
>>>>>>>> Thank you.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Edwin
>>>>>>>> 
>>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>>>>>>> 
>>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the
>>>>>>>> {2,}
>>>>>>>>> part you could try
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If you also want to match CRLF then
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>> Windows 10
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>> 
>>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple
>>>>>>>> \n
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Paul,
>>>>>>>>> 
>>>>>>>>> Thanks for your reply.
>>>>>>>>> 
>>>>>>>>> When I use this pattern:
>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>> </processor>
>>>>>>>>> 
>>>>>>>>> It is working for some sentence within the same content and not
>>>>>>>> working for
>>>>>>>>> some sentences. Please see below for the one that is working and
>>>>>>>> another
>>>>>>>>> that is not working (partially working):
>>>>>>>>> 
>>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>>>> 
>>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> 
>>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>> \n\n
>>>>>>>> \n
>>>>>>>>> \n\n
>>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>>>>>> 18, 2018
>>>>>>>>> at 10:07 AM
>>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>>> <br><br>On
>>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>>> 
>>>>>>>>> We would appreciate your help to see what is wrong?
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Edwin
>>>>>>>>> 
>>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>>>>>>>>> 
>>>>>>>>>> You don’t say what happens, just that it is not working. I assume
>>>>>>>> nothing
>>>>>>>>>> is replaced? Perhaps the pattern should be
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ??
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>>> Windows 10
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> solr-user@lucene.apache.org
>>>>>>>>> 
>>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more
>>>>>>>> than
>>>>>>>>> two
>>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>>>>>> \n
>>>>>>>>> \n),
>>>>>>>>>> and replace it with two <br>.
>>>>>>>>>> 
>>>>>>>>>> I use the following regex pattern and it is working when I test it
>>>>>>>> in
>>>>>>>>>> regex101.com. But it is not working when I put it inside the
>>>>>>>>>> RegexReplaceProcessorFactory as below:
>>>>>>>>>> 
>>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>>> </processor>
>>>>>>>>>>         </updateRequestProcessorChain>
>>>>>>>>>> 
>>>>>>>>>> To explain further about my regex pattern, \s* is instructing the
>>>>>>>> regex
>>>>>>>>> to
>>>>>>>>>> match any \n that have space after and {2,} is instructing the
>>>>>>>> regex to
>>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>>>>>>>> 
>>>>>>>>>> Please kindly let me know what is wrong and how should I do it?
>>>>>>>>>> 
>>>>>>>>>> I am using Solr 7.6.0.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Edwin
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>> 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Would like to check, if there is any difference in performance when we use
the two different patterns method?

<str name="pattern">(\n\W*){2,}</str>

<str name="pattern">[ \t\x0b\f]*\r?\n</str>

Regards,
Edwin

On Thu, 14 Mar 2019 at 09:36, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Thanks for your reply.
>
> So far we did not find cases of punctuation that are being removed.
>
> Our aim is to remove the list of spaces (\n) into 2 <br>, and they are not
> likely to have any punctuation in between.
>
> Do you know if this pattern  <str name="pattern">(\n\W*){2,}</str> that
> we are using is ok?
> Or would the other pattern like  <str name="pattern">[
> \t\x0b\f]*\r?\n</str> is better?
>
> Regards,
> Edwin
>
> On Wed, 13 Mar 2019 at 20:08, <pa...@ub.unibe.ch> wrote:
>
>> Hi Edwin,
>> With \W you will also replace non-word characters such as punktuation. If
>> that's OK fine. Otherwise you need to identify the white space characters
>> that are causing the problem.
>> ________________________________
>> Von: Zheng Lin Edwin Yeo <ed...@gmail.com>
>> Gesendet: Mittwoch, 13. März 2019 03:25:39
>> An: solr-user@lucene.apache.org
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>> Hi,
>>
>> We have managed to resolve the issue, by changing the \s to \W. The reason
>> could be due to that some of the spaces and white space instead of just a
>> space. Using \s will only remove the spaces and not the white spaces, but
>> using \W will remove the white spaces as well.
>>
>> We have used this config, and it works.
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\W*){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\W*){1,}</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> Regards,
>> Edwin
>>
>> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Has anyone else faced the same issue before?
>> > So far all the regex patterns that we tried in this thread are not able
>> to
>> > resolve the issue.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> Sorry, I realized there is an extra ']' in the pattern provided, which
>> is
>> >> why there are so many <br> in the output.
>> >>
>> >> The output is exactly the same as previously (previous index result) if
>> >> we remove the extra ']', as shown in the configuration below.
>> >>
>> >>  <processor class="solr.RegexReplaceProcessorFactory">
>> >>    <str name="fieldName">content</str>
>> >>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>    <str name="replacement">&lt;br&gt;</str>
>> >>    <bool name="literalReplacement">true</bool>
>> >>  </processor>
>> >>  <processor class="solr.RegexReplaceProcessorFactory">
>> >>    <str name="fieldName">content</str>
>> >>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
>> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>    <bool name="literalReplacement">true</bool>
>> >>  </processor>
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >>
>> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> Hi Paul,
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> For the 2nd pattern, if we put this pattern <str
>> >>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
>> >>> configurations below:
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>    <str name="fieldName">content</str>
>> >>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>    <str name="replacement">&lt;br&gt;</str>
>> >>>    <bool name="literalReplacement">true</bool>
>> >>> </processor>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>    <str name="fieldName">content</str>
>> >>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>> >>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>    <bool name="literalReplacement">true</bool>
>> >>> </processor>
>> >>>
>> >>> It will not be able to change all those more than 3 <br> to 2 <br>.
>> >>>
>> >>> We will end up with many <br> in the output, like the example below:
>> >>>
>> >>>  http://www.concorded.com/<br><br>
>> <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
>> On Tue, Dec 18, 2018
>> >>>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
>> >>>
>> >>>> Hi Edwin
>> >>>>
>> >>>>
>> >>>>
>> >>>> I can’t understand why the pattern is not working and where the
>> spaces
>> >>>> between the <br> are coming from. It should be possible to allow for
>> spaces
>> >>>> between the <br> in the second match pattern however i.e. 2nd pattern
>> >>>>
>> >>>>
>> >>>>
>> >>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>> >>>>
>> >>>>
>> >>>>
>> >>>> /Paul
>> >>>>
>> >>>>
>> >>>>
>> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>> >>>> Windows 10
>> >>>>
>> >>>>
>> >>>>
>> >>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>>> Gesendet: Mittwoch, 6. März 2019 16:28
>> >>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> >>>>
>> >>>>
>> >>>>
>> >>>> Hi Paul,
>> >>>>
>> >>>> I have tried with the first match pattern to be <str name="pattern">[
>> >>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>> >>>>
>> >>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>>    <str name="fieldName">content</str>
>> >>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>>    <str name="replacement">&lt;br&gt;</str>
>> >>>>    <bool name="literalReplacement">true</bool>
>> >>>> </processor>
>> >>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>>    <str name="fieldName">content</str>
>> >>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>> >>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>>    <bool name="literalReplacement">true</bool>
>> >>>> </processor>
>> >>>>
>> >>>> However, the result is still the same as before (previous index
>> >>>> results),
>> >>>> with the 4 <br>.
>> >>>>
>> >>>> Regards,
>> >>>> Edwin
>> >>>>
>> >>>>
>> >>>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>> >>>>
>> >>>> > Hi Edwin
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4
>> <br>,
>> >>>> it’s
>> >>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first
>> match
>> >>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > i.e. [space tab vertical-tab formfeed]
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Regards,
>> >>>> >
>> >>>> > Paul
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>> >>>> > Windows 10
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>> >>>> > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>> >
>> >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple
>> >>>> \n
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Hi Paul,
>> >>>> >
>> >>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead
>> of
>> >>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of
>> >>>> (&lt;br&gt;&lt;br&gt;){3,}
>> >>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we
>> have
>> >>>> put
>> >>>> > the <br> two times in the pattern, which is the reason that there
>> are
>> >>>> more
>> >>>> > <br> in the result, as cases where there are less than 6 <br> are
>> not
>> >>>> being
>> >>>> > replaced, so we ended up having up to 5 <br> in the index.
>> >>>> >
>> >>>> > Modified configuration:
>> >>>> >  <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> >    <str name="fieldName">content</str>
>> >>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>> >>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> >    <bool name="literalReplacement">true</bool>
>> >>>> >  </processor>
>> >>>> >
>> >>>> > This will bring us back to the result of the previous index
>> content,
>> >>>> > meaning the issue of having the 4 <br> is still there.
>> >>>> >
>> >>>> > Regards,
>> >>>> > Edwin
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > Regards,
>> >>>> > Edwin
>> >>>> >
>> >>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <
>> >>>> edwinyeozl@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > Hi Paul,
>> >>>> > >
>> >>>> > > Further to my previous email, which there was an extra "}" in the
>> >>>> > > configuration, I have changed to use the below configuration
>> based
>> >>>> on
>> >>>> > your
>> >>>> > > suggestion.
>> >>>> > >
>> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >    <str name="fieldName">content</str>
>> >>>> > >    <str name="pattern">[ \t]*\r?\n</str>
>> >>>> > >    <str name="replacement">&lt;br&gt;</str>
>> >>>> > >    <bool name="literalReplacement">true</bool>
>> >>>> > > </processor>
>> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >    <str name="fieldName">content</str>
>> >>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >    <bool name="literalReplacement">true</bool>
>> >>>> > > </processor>
>> >>>> > >
>> >>>> > > However, the result that I get still has more than 2 <br>. In
>> fact,
>> >>>> the
>> >>>> > > result become worse, as you can see from the comparison below.
>> >>>> > >
>> >>>> > > Example 1: The sentence that the regex pattern used to work
>> >>>> correctly.
>> >>>> > But
>> >>>> > > with the latest pattern, it has now changed from 2 <br> to
>> become 5
>> >>>> <br>,
>> >>>> > > which is wrong.
>> >>>> > > *Original content in EML file:*
>> >>>> > > Dear Sir,
>> >>>> > >
>> >>>> > >
>> >>>> > > I am terminating
>> >>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> >>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>> >>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>> >>>> > terminating
>> >>>> > >
>> >>>> > > Example 2: The sentence that the above regex pattern is partially
>> >>>> working
>> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>>> > > *Original content in EML file:*
>> >>>> > >
>> >>>> > > *exalted*
>> >>>> > >
>> >>>> > > *Psalm 89:17*
>> >>>> > >
>> >>>> > >
>> >>>> > > 3 Choa Chu Kang Avenue 4
>> >>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>  \n\n  3
>> >>>> Choa
>> >>>> > > Chu Kang Avenue 4, Singapore
>> >>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>> >>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>> >>>> <br><br>  3
>> >>>> > > Choa Chu Kang Avenue 3, Singapor4
>> >>>> > >
>> >>>> > > Example 3: The sentence that the above regex pattern is partially
>> >>>> working
>> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the
>> >>>> latest
>> >>>> > code,
>> >>>> > > there are now 5 <br>
>> >>>> > > *Original content in EML file:*
>> >>>> > >
>> >>>> > > http://www.concorded.com/
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n
>> >>>> \n\n \n\n
>> >>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> >>>> 2018 at
>> >>>> > > 10:07 AM
>> >>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>> >>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > > *Current Index content:* http://www.concorded.com/<br><br>
>> >>>> <br><br><br>
>> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >
>> >>>> > >
>> >>>> > > Regards,
>> >>>> > > Edwin
>> >>>> > >
>> >>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>> >>>> edwinyeozl@gmail.com>
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > >> Hi Paul,
>> >>>> > >>
>> >>>> > >> Thank you for the reply.
>> >>>> > >>
>> >>>> > >> I have tried to add the following configuration according to
>> your
>> >>>> > >> suggestion:
>> >>>> > >>
>> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>    <str name="fieldName">content</str>
>> >>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>> >>>> > >>    <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>    <bool name="literalReplacement">true</bool>
>> >>>> > >> </processor>
>> >>>> > >>
>> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>    <str name="fieldName">content</str>
>> >>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>    <bool name="literalReplacement">true</bool>
>> >>>> > >> </processor>
>> >>>> > >>
>> >>>> > >> However, none of the \n is being removed this time round.
>> >>>> > >> Is the order and/or the pattern correct?
>> >>>> > >>
>> >>>> > >> Regards,
>> >>>> > >> Edwin
>> >>>> > >>
>> >>>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>> >>>> > >>
>> >>>> > >>> Hi Edwin
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Try for the first pattern/replacement
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>> >>>> > >>>
>> >>>> > >>> <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Now all line endings and preceding whitespace characters
>> should be
>> >>>> > >>> changed to ‘<br>’.
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>> >>>> > sequences
>> >>>> > >>> to 2 ‘<br>’ sequences:
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> >>>> > >>>
>> >>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Hope this approach works. Sorry for not replying earlier and
>> best
>> >>>> > >>> regards,
>> >>>> > >>>
>> >>>> > >>> Paul
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> für
>> >>>> > >>> Windows 10
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>> >>>> > >>> An: solr-user@lucene.apache.org<mailto:
>> >>>> solr-user@lucene.apache.org>
>> >>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> >>>> multiple \n
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> Hi,
>> >>>> > >>>
>> >>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>> >>>> well.
>> >>>> > >>>
>> >>>> > >>> Regards,
>> >>>> > >>> Edwin
>> >>>> > >>>
>> >>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>> >>>> > edwinyeozl@gmail.com>
>> >>>> > >>> wrote:
>> >>>> > >>>
>> >>>> > >>> > Hi,
>> >>>> > >>> >
>> >>>> > >>> > Anyone else has other suggestions or have faced the same
>> >>>> problem?
>> >>>> > >>> >
>> >>>> > >>> > Regards,
>> >>>> > >>> > Edwin
>> >>>> > >>> >
>> >>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>> >>>> > >>> edwinyeozl@gmail.com>
>> >>>> > >>> > wrote:
>> >>>> > >>> >
>> >>>> > >>> >> Hi Paul,
>> >>>> > >>> >>
>> >>>> > >>> >> If I tried to execute the second step first, then I will
>> only
>> >>>> get a
>> >>>> > >>> >> single <br> for those with 2 <br>.
>> >>>> > >>> >> For those that we originally get 4 <br>, there will be 2
>> <br>
>> >>>> with a
>> >>>> > >>> >> space in between.
>> >>>> > >>> >>
>> >>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since
>> the
>> >>>> > second
>> >>>> > >>> >> step is to replace with a single <br>.
>> >>>> > >>> >> But it has not solved the underlying problem yet.
>> >>>> > >>> >>
>> >>>> > >>> >> Regards,
>> >>>> > >>> >> Edwin
>> >>>> > >>> >>
>> >>>> > >>> >>
>> >>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch>
>> wrote:
>> >>>> > >>> >>
>> >>>> > >>> >>> If the second step is executed first, then you will get the
>> >>>> > unwanted
>> >>>> > >>> 4
>> >>>> > >>> >>> <br>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Gesendet von Mail<
>> >>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> für
>> >>>> > >>> >>> Windows 10
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> >>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>> >>>> solr-user@lucene.apache.org
>> >>>> > >
>> >>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> >>>> > multiple
>> >>>> > >>> \n
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Hi Jörn ,
>> >>>> > >>> >>>
>> >>>> > >>> >>> Do you mean the regex is not correct?
>> >>>> > >>> >>>
>> >>>> > >>> >>> We are already using two RegexReplaceProcessorFactory
>> steps,
>> >>>> like
>> >>>> > >>> the one
>> >>>> > >>> >>> shown below. The output that we get is still the same.
>> >>>> > >>> >>>
>> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>>      <str name="fieldName">content</str>
>> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> <processor>
>> >>>> > >>> >>>
>> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>>      <str name="fieldName">content</str>
>> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> <processor>
>> >>>> > >>> >>>
>> >>>> > >>> >>> Regards,
>> >>>> > >>> >>> Edwin
>> >>>> > >>> >>>
>> >>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
>> >>>> jornfranke@gmail.com>
>> >>>> > >>> wrote:
>> >>>> > >>> >>>
>> >>>> > >>> >>> > Then you need two regexprocessfactory steps
>> >>>> > >>> >>> >
>> >>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> edwinyeozl@gmail.com
>> >>>> > >>> >>> > >:
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Hi,
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Thanks for the reply.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Do you know of any regex online tool that works
>> correctly
>> >>>> for
>> >>>> > >>> Java
>> >>>> > >>> >>> regex?
>> >>>> > >>> >>> > > I tried to find some, but they are not working
>> properly.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
>> >>>> <br><br>, and
>> >>>> > >>> >>> single \n
>> >>>> > >>> >>> > > with single <br>.
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > > Regards,
>> >>>> > >>> >>> > > Edwin
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>> >>>> > jornfranke@gmail.com
>> >>>> > >>> >
>> >>>> > >>> >>> wrote:
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a
>> bug
>> >>>> - it
>> >>>> > >>> would
>> >>>> > >>> >>> then
>> >>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
>> >>>> supports
>> >>>> > Java
>> >>>> > >>> >>> regex
>> >>>> > >>> >>> > for
>> >>>> > >>> >>> > >> your solution.
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>> >>>> > >>> >>> > >> One that deals with single \n and one that deals with
>> >>>> more
>> >>>> > than
>> >>>> > >>> one
>> >>>> > >>> >>> \n
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinyeozl@gmail.com
>> >>>> > >>> >>> > >>> :
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Hi,
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> We have tried with the following pattern ([
>> >>>> \t]*\r?\n){2,}
>> >>>> > and
>> >>>> > >>> >>> > >>> configuration:
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>> >>>> > >>> >>> > >>> </processor>
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> However, the issue is still occurring.
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Anyone else is able to help?
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> Regards,
>> >>>> > >>> >>> > >>> Edwin
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinyeozl@gmail.com>
>> >>>> > >>> >>> > >>> wrote:
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>> Hi,
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr
>> 7.7.0 as
>> >>>> > well.
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> Regards,
>> >>>> > >>> >>> > >>>> Edwin
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinyeozl@gmail.com
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>> wrote:
>> >>>> > >>> >>> > >>>>
>> >>>> > >>> >>> > >>>>> Hi,
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> Regards,
>> >>>> > >>> >>> > >>>>> Edwin
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > edwinyeozl@gmail.com
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>>> wrote:
>> >>>> > >>> >>> > >>>>>
>> >>>> > >>> >>> > >>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>> >>>> when we
>> >>>> > >>> try
>> >>>> > >>> >>> in on
>> >>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>> >>>> correct
>> >>>> > >>> >>> result for
>> >>>> > >>> >>> > >> all
>> >>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
>> >>>> <br><br>, and
>> >>>> > >>> not
>> >>>> > >>> >>> more
>> >>>> > >>> >>> > >> than
>> >>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our
>> earlier
>> >>>> > >>> examples).
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> Regards,
>> >>>> > >>> >>> > >>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> >>>> > >>> >>> > >> edwinyeozl@gmail.com>
>> >>>> > >>> >>> > >>>>>> wrote:
>> >>>> > >>> >>> > >>>>>>
>> >>>> > >>> >>> > >>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
>> >>>> i.e.
>> >>>> > <str
>> >>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the
>> following
>> >>>> > regex
>> >>>> > >>> >>> pattern:
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> >>>> > >>> >>> > >>>>>>>  <str
>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> However, we are also getting the exact same
>> results
>> >>>> as
>> >>>> > the
>> >>>> > >>> >>> earlier
>> >>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you
>> have
>> >>>> other
>> >>>> > >>> (non
>> >>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
>> >>>> there are
>> >>>> > >>> no
>> >>>> > >>> >>> non
>> >>>> > >>> >>> > >> printing
>> >>>> > >>> >>> > >>>>>>> characters. It is just next line with a space.
>> You
>> >>>> can
>> >>>> > >>> refer
>> >>>> > >>> >>> to the
>> >>>> > >>> >>> > >>>>>>> original content in the same examples below.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> working
>> >>>> > >>> >>> > >>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>> Dear Sir,
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> I am terminating
>> >>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
>> am
>> >>>> > >>> terminating
>> >>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> terminating
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>> >>>> are 4
>> >>>> > >>> <br>)
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> *exalted*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> *Psalm 89:17*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> >>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>>  \n\n
>> >>>> > >>> >>>  \n\n  3
>> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>>  <br><br>
>> >>>> > >>> >>> <br><br>3
>> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>> >>>> are 4
>> >>>> > >>> <br>)
>> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> >  \n\n
>> >>>> > >>> >>> > >> \n
>> >>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n
>> \n\n\n
>> >>>> > >>> \n\n\n  On
>> >>>> > >>> >>> Tue,
>> >>>> > >>> >>> > >> Dec 18,
>> >>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>> *Index content: *
>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that
>> you
>> >>>> may
>> >>>> > >>> have.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <
>> >>>> paul.dodd@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Hi Edwin
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space
>> should
>> >>>> > preceed
>> >>>> > >>> >>> the \n
>> >>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> >>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
>> >>>> printing)
>> >>>> > >>> >>> characters
>> >>>> > >>> >>> > >>>>>>>> than \n?
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> >>> > >> für
>> >>>> > >>> >>> > >>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinyeozl@gmail.com>
>> >>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> >>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > solr-user@lucene.apache.org>
>> >>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
>> pattern
>> >>>> to
>> >>>> > >>> detect
>> >>>> > >>> >>> > >> multiple \n
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
>> >>>> follow:
>> >>>> > >>> >>> > >>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>  <str
>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
>> >>>> Example
>> >>>> > 1,2
>> >>>> > >>> and
>> >>>> > >>> >>> 3
>> >>>> > >>> >>> > >> below.
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> working
>> >>>> > >>> >>> > >>>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
>> I am
>> >>>> > >>> >>> terminating
>> >>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> terminating
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>> >  \n\n
>> >>>> > >>> >>>  \n\n
>> >>>> > >>> >>> > 3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>>  <br><br>
>> >>>> > >>> >>> > <br><br>3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> >  \n\n
>> >>>> > >>> >>> > >>>>>>>> \n \n\n
>> >>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> >>>> \n\n\n
>> >>>> > On
>> >>>> > >>> >>> Tue, Dec
>> >>>> > >>> >>> > >> 18,
>> >>>> > >>> >>> > >>>>>>>> 2018
>> >>>> > >>> >>> > >>>>>>>> at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>> *Index content: *
>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>>> <br><br>On
>> >>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Any further suggestion?
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
>> >>>> paul.dodd@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
>> >>>> then
>> >>>> > >>> failing
>> >>>> > >>> >>> on
>> >>>> > >>> >>> > the
>> >>>> > >>> >>> > >>>>>>>> {2,}
>> >>>> > >>> >>> > >>>>>>>>> part you could try
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>> >>>> > >>> >>> > >
>> >>>> > >>> >>> > >>>>>>>> für
>> >>>> > >>> >>> > >>>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinyeozl@gmail.com>
>> >>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> >>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > solr-user@lucene.apache.org
>> >>>> > >>> >>> > >>>
>> >>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
>> pattern
>> >>>> to
>> >>>> > >>> detect
>> >>>> > >>> >>> > >> multiple
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Hi Paul,
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> When I use this pattern:
>> >>>> > >>> >>> > >>>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> >>>> > >>> >>> > >>>>>>>>>  <str
>> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>> >>>> content
>> >>>> > >>> and
>> >>>> > >>> >>> not
>> >>>> > >>> >>> > >>>>>>>> working for
>> >>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one
>> that
>> >>>> is
>> >>>> > >>> working
>> >>>> > >>> >>> and
>> >>>> > >>> >>> > >>>>>>>> another
>> >>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> working
>> >>>> > >>> >>> > >>>>>>>> correctly
>> >>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n
>> I
>> >>>> am
>> >>>> > >>> >>> terminating
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> >>>> > terminating
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm
>> 89:17
>> >>>> >  \n\n
>> >>>> > >>> >>> >  \n\n  3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >>>> >  <br><br>
>> >>>> > >>> >>> > <br><br>3
>> >>>> > >>> >>> > >>>>>>>> Choa
>> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>> >>>> pattern is
>> >>>> > >>> >>> partially
>> >>>> > >>> >>> > >>>>>>>> working
>> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> >>>> <br>)
>> >>>> > >>> >>> > >>>>>>>>> *Original content:*
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>>  \n\n
>> >>>> > >>> >>> > >> \n\n
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>> \n\n
>> >>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> >>>> \n\n\n
>> >>>> > On
>> >>>> > >>> >>> Tue,
>> >>>> > >>> >>> > Dec
>> >>>> > >>> >>> > >>>>>>>> 18, 2018
>> >>>> > >>> >>> > >>>>>>>>> at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>> *Index content: *
>> >>>> http://www.concordpri.moe.edu.sg/
>> >>>> > >>> >>>  <br><br>
>> >>>> > >>> >>> > >>>>>>>> <br><br>On
>> >>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is
>> >>>> wrong?
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Thank you.
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>> >>>> paul.dodd@ub.unibe.ch>
>> >>>> > >>> wrote:
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is
>> not
>> >>>> > >>> working. I
>> >>>> > >>> >>> > assume
>> >>>> > >>> >>> > >>>>>>>> nothing
>> >>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> ??
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>> >>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>>> > >>> >>> > >>>>>>>> für
>> >>>> > >>> >>> > >>>>>>>>>> Windows 10
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> >>>> edwinyeozl@gmail.com>
>> >>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> >>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>>> > >>> >>> > >> solr-user@lucene.apache.org
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern
>> to
>> >>>> > detect
>> >>>> > >>> >>> multiple
>> >>>> > >>> >>> > >> \n
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Hi,
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I am trying to use the
>> >>>> RegexReplaceProcessorFactory to
>> >>>> > >>> >>> remove
>> >>>> > >>> >>> > more
>> >>>> > >>> >>> > >>>>>>>> than
>> >>>> > >>> >>> > >>>>>>>>> two
>> >>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>> >>>> \n\n,
>> >>>> > \n
>> >>>> > >>> \n,
>> >>>> > >>> >>> \n
>> >>>> > >>> >>> > \n
>> >>>> > >>> >>> > >>>>>>>> \n
>> >>>> > >>> >>> > >>>>>>>>> \n),
>> >>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
>> >>>> working
>> >>>> > >>> when I
>> >>>> > >>> >>> test
>> >>>> > >>> >>> > it
>> >>>> > >>> >>> > >>>>>>>> in
>> >>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I
>> put
>> >>>> it
>> >>>> > >>> inside
>> >>>> > >>> >>> the
>> >>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain
>> name="removeCode">
>> >>>> > >>> >>> > >>>>>>>>>> <processor
>> >>>> class="solr.RegexReplaceProcessorFactory">
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> >>>> > >>> >>> > >>>>>>>>>>  <str
>> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>> > >>> >>> > >>>>>>>>>> </processor>
>> >>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern,
>> \s* is
>> >>>> > >>> >>> instructing
>> >>>> > >>> >>> > the
>> >>>> > >>> >>> > >>>>>>>> regex
>> >>>> > >>> >>> > >>>>>>>>> to
>> >>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>> >>>> > >>> instructing
>> >>>> > >>> >>> the
>> >>>> > >>> >>> > >>>>>>>> regex to
>> >>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern
>> (\n).
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and
>> how
>> >>>> should
>> >>>> > >>> I do
>> >>>> > >>> >>> it?
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>> Regards,
>> >>>> > >>> >>> > >>>>>>>>>> Edwin
>> >>>> > >>> >>> > >>>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>>
>> >>>> > >>> >>> > >>>>>>>>
>> >>>> > >>> >>> > >>>>>>>
>> >>>> > >>> >>> > >>
>> >>>> > >>> >>> >
>> >>>> > >>> >>>
>> >>>> > >>> >>
>> >>>> > >>>
>> >>>> > >>
>> >>>> >
>> >>>>
>> >>>
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Thanks for your reply.

So far we did not find cases of punctuation that are being removed.

Our aim is to remove the list of spaces (\n) into 2 <br>, and they are not
likely to have any punctuation in between.

Do you know if this pattern  <str name="pattern">(\n\W*){2,}</str> that we
are using is ok?
Or would the other pattern like  <str name="pattern">[
\t\x0b\f]*\r?\n</str> is better?

Regards,
Edwin

On Wed, 13 Mar 2019 at 20:08, <pa...@ub.unibe.ch> wrote:

> Hi Edwin,
> With \W you will also replace non-word characters such as punktuation. If
> that's OK fine. Otherwise you need to identify the white space characters
> that are causing the problem.
> ________________________________
> Von: Zheng Lin Edwin Yeo <ed...@gmail.com>
> Gesendet: Mittwoch, 13. März 2019 03:25:39
> An: solr-user@lucene.apache.org
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
> Hi,
>
> We have managed to resolve the issue, by changing the \s to \W. The reason
> could be due to that some of the spaces and white space instead of just a
> space. Using \s will only remove the spaces and not the white spaces, but
> using \W will remove the white spaces as well.
>
> We have used this config, and it works.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n\W*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n\W*){1,}</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> Regards,
> Edwin
>
> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Has anyone else faced the same issue before?
> > So far all the regex patterns that we tried in this thread are not able
> to
> > resolve the issue.
> >
> > Regards,
> > Edwin
> >
> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Sorry, I realized there is an extra ']' in the pattern provided, which
> is
> >> why there are so many <br> in the output.
> >>
> >> The output is exactly the same as previously (previous index result) if
> >> we remove the extra ']', as shown in the configuration below.
> >>
> >>  <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
> >>    <str name="replacement">&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >>  </processor>
> >>  <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >>  </processor>
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >>
> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
> >> wrote:
> >>
> >>> Hi Paul,
> >>>
> >>> Thanks for the reply.
> >>>
> >>> For the 2nd pattern, if we put this pattern <str
> >>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
> >>> configurations below:
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>    <str name="fieldName">content</str>
> >>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
> >>>    <str name="replacement">&lt;br&gt;</str>
> >>>    <bool name="literalReplacement">true</bool>
> >>> </processor>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>    <str name="fieldName">content</str>
> >>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
> >>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>    <bool name="literalReplacement">true</bool>
> >>> </processor>
> >>>
> >>> It will not be able to change all those more than 3 <br> to 2 <br>.
> >>>
> >>> We will end up with many <br> in the output, like the example below:
> >>>
> >>>  http://www.concorded.com/<br><br>
> <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
> On Tue, Dec 18, 2018
> >>>
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
> >>>
> >>>> Hi Edwin
> >>>>
> >>>>
> >>>>
> >>>> I can’t understand why the pattern is not working and where the spaces
> >>>> between the <br> are coming from. It should be possible to allow for
> spaces
> >>>> between the <br> in the second match pattern however i.e. 2nd pattern
> >>>>
> >>>>
> >>>>
> >>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
> >>>>
> >>>>
> >>>>
> >>>> /Paul
> >>>>
> >>>>
> >>>>
> >>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>>> Windows 10
> >>>>
> >>>>
> >>>>
> >>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>> Gesendet: Mittwoch, 6. März 2019 16:28
> >>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
> \n
> >>>>
> >>>>
> >>>>
> >>>> Hi Paul,
> >>>>
> >>>> I have tried with the first match pattern to be <str name="pattern">[
> >>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
> >>>>
> >>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>    <str name="fieldName">content</str>
> >>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
> >>>>    <str name="replacement">&lt;br&gt;</str>
> >>>>    <bool name="literalReplacement">true</bool>
> >>>> </processor>
> >>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>    <str name="fieldName">content</str>
> >>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
> >>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>    <bool name="literalReplacement">true</bool>
> >>>> </processor>
> >>>>
> >>>> However, the result is still the same as before (previous index
> >>>> results),
> >>>> with the 4 <br>.
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>>
> >>>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
> >>>>
> >>>> > Hi Edwin
> >>>> >
> >>>> >
> >>>> >
> >>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>,
> >>>> it’s
> >>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first
> match
> >>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
> >>>> >
> >>>> >
> >>>> >
> >>>> > i.e. [space tab vertical-tab formfeed]
> >>>> >
> >>>> >
> >>>> >
> >>>> > Regards,
> >>>> >
> >>>> > Paul
> >>>> >
> >>>> >
> >>>> >
> >>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> für
> >>>> > Windows 10
> >>>> >
> >>>> >
> >>>> >
> >>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>> > Gesendet: Mittwoch, 6. März 2019 07:44
> >>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
> >>>> \n
> >>>> >
> >>>> >
> >>>> >
> >>>> > Hi Paul,
> >>>> >
> >>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead
> of
> >>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of
> >>>> (&lt;br&gt;&lt;br&gt;){3,}
> >>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we have
> >>>> put
> >>>> > the <br> two times in the pattern, which is the reason that there
> are
> >>>> more
> >>>> > <br> in the result, as cases where there are less than 6 <br> are
> not
> >>>> being
> >>>> > replaced, so we ended up having up to 5 <br> in the index.
> >>>> >
> >>>> > Modified configuration:
> >>>> >  <processor class="solr.RegexReplaceProcessorFactory">
> >>>> >    <str name="fieldName">content</str>
> >>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
> >>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> >    <bool name="literalReplacement">true</bool>
> >>>> >  </processor>
> >>>> >
> >>>> > This will bring us back to the result of the previous index content,
> >>>> > meaning the issue of having the 4 <br> is still there.
> >>>> >
> >>>> > Regards,
> >>>> > Edwin
> >>>> >
> >>>> >
> >>>> >
> >>>> > Regards,
> >>>> > Edwin
> >>>> >
> >>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <
> >>>> edwinyeozl@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > > Hi Paul,
> >>>> > >
> >>>> > > Further to my previous email, which there was an extra "}" in the
> >>>> > > configuration, I have changed to use the below configuration based
> >>>> on
> >>>> > your
> >>>> > > suggestion.
> >>>> > >
> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >    <str name="fieldName">content</str>
> >>>> > >    <str name="pattern">[ \t]*\r?\n</str>
> >>>> > >    <str name="replacement">&lt;br&gt;</str>
> >>>> > >    <bool name="literalReplacement">true</bool>
> >>>> > > </processor>
> >>>> > > <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >    <str name="fieldName">content</str>
> >>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >    <bool name="literalReplacement">true</bool>
> >>>> > > </processor>
> >>>> > >
> >>>> > > However, the result that I get still has more than 2 <br>. In
> fact,
> >>>> the
> >>>> > > result become worse, as you can see from the comparison below.
> >>>> > >
> >>>> > > Example 1: The sentence that the regex pattern used to work
> >>>> correctly.
> >>>> > But
> >>>> > > with the latest pattern, it has now changed from 2 <br> to become
> 5
> >>>> <br>,
> >>>> > > which is wrong.
> >>>> > > *Original content in EML file:*
> >>>> > > Dear Sir,
> >>>> > >
> >>>> > >
> >>>> > > I am terminating
> >>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> >>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
> >>>> > terminating
> >>>> > >
> >>>> > > Example 2: The sentence that the above regex pattern is partially
> >>>> working
> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>> > > *Original content in EML file:*
> >>>> > >
> >>>> > > *exalted*
> >>>> > >
> >>>> > > *Psalm 89:17*
> >>>> > >
> >>>> > >
> >>>> > > 3 Choa Chu Kang Avenue 4
> >>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
> 3
> >>>> Choa
> >>>> > > Chu Kang Avenue 4, Singapore
> >>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
> >>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
> >>>> <br><br>  3
> >>>> > > Choa Chu Kang Avenue 3, Singapor4
> >>>> > >
> >>>> > > Example 3: The sentence that the above regex pattern is partially
> >>>> working
> >>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the
> >>>> latest
> >>>> > code,
> >>>> > > there are now 5 <br>
> >>>> > > *Original content in EML file:*
> >>>> > >
> >>>> > > http://www.concorded.com/
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
> >>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n
> >>>> \n\n \n\n
> >>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> >>>> 2018 at
> >>>> > > 10:07 AM
> >>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
> >>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>> > > *Current Index content:* http://www.concorded.com/<br><br>
> >>>> <br><br><br>
> >>>> > > On Tue, Dec 18, 2018 at 10:07 AM
> >>>> > >
> >>>> > >
> >>>> > > Regards,
> >>>> > > Edwin
> >>>> > >
> >>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
> >>>> edwinyeozl@gmail.com>
> >>>> > > wrote:
> >>>> > >
> >>>> > >> Hi Paul,
> >>>> > >>
> >>>> > >> Thank you for the reply.
> >>>> > >>
> >>>> > >> I have tried to add the following configuration according to your
> >>>> > >> suggestion:
> >>>> > >>
> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >>    <str name="fieldName">content</str>
> >>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
> >>>> > >>    <str name="replacement">&lt;br&gt;</str>
> >>>> > >>    <bool name="literalReplacement">true</bool>
> >>>> > >> </processor>
> >>>> > >>
> >>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >>    <str name="fieldName">content</str>
> >>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>    <bool name="literalReplacement">true</bool>
> >>>> > >> </processor>
> >>>> > >>
> >>>> > >> However, none of the \n is being removed this time round.
> >>>> > >> Is the order and/or the pattern correct?
> >>>> > >>
> >>>> > >> Regards,
> >>>> > >> Edwin
> >>>> > >>
> >>>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
> >>>> > >>
> >>>> > >>> Hi Edwin
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Try for the first pattern/replacement
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
> >>>> > >>>
> >>>> > >>> <str name="replacement">&lt;br&gt;</str>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Now all line endings and preceding whitespace characters should
> be
> >>>> > >>> changed to ‘<br>’.
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
> >>>> > sequences
> >>>> > >>> to 2 ‘<br>’ sequences:
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>> > >>>
> >>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Hope this approach works. Sorry for not replying earlier and
> best
> >>>> > >>> regards,
> >>>> > >>>
> >>>> > >>> Paul
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>> für
> >>>> > >>> Windows 10
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
> >>>> > >>> An: solr-user@lucene.apache.org<mailto:
> >>>> solr-user@lucene.apache.org>
> >>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >>>> multiple \n
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> Hi,
> >>>> > >>>
> >>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
> >>>> well.
> >>>> > >>>
> >>>> > >>> Regards,
> >>>> > >>> Edwin
> >>>> > >>>
> >>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
> >>>> > edwinyeozl@gmail.com>
> >>>> > >>> wrote:
> >>>> > >>>
> >>>> > >>> > Hi,
> >>>> > >>> >
> >>>> > >>> > Anyone else has other suggestions or have faced the same
> >>>> problem?
> >>>> > >>> >
> >>>> > >>> > Regards,
> >>>> > >>> > Edwin
> >>>> > >>> >
> >>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
> >>>> > >>> edwinyeozl@gmail.com>
> >>>> > >>> > wrote:
> >>>> > >>> >
> >>>> > >>> >> Hi Paul,
> >>>> > >>> >>
> >>>> > >>> >> If I tried to execute the second step first, then I will only
> >>>> get a
> >>>> > >>> >> single <br> for those with 2 <br>.
> >>>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
> >>>> with a
> >>>> > >>> >> space in between.
> >>>> > >>> >>
> >>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since
> the
> >>>> > second
> >>>> > >>> >> step is to replace with a single <br>.
> >>>> > >>> >> But it has not solved the underlying problem yet.
> >>>> > >>> >>
> >>>> > >>> >> Regards,
> >>>> > >>> >> Edwin
> >>>> > >>> >>
> >>>> > >>> >>
> >>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
> >>>> > >>> >>
> >>>> > >>> >>> If the second step is executed first, then you will get the
> >>>> > unwanted
> >>>> > >>> 4
> >>>> > >>> >>> <br>
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>> Gesendet von Mail<
> >>>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>> > >>> für
> >>>> > >>> >>> Windows 10
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
> >>>> solr-user@lucene.apache.org
> >>>> > >
> >>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >>>> > multiple
> >>>> > >>> \n
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>>
> >>>> > >>> >>> Hi Jörn ,
> >>>> > >>> >>>
> >>>> > >>> >>> Do you mean the regex is not correct?
> >>>> > >>> >>>
> >>>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps,
> >>>> like
> >>>> > >>> the one
> >>>> > >>> >>> shown below. The output that we get is still the same.
> >>>> > >>> >>>
> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>>      <str name="fieldName">content</str>
> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
> >>>> > >>> >>> <processor>
> >>>> > >>> >>>
> >>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>>      <str name="fieldName">content</str>
> >>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> >>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
> >>>> > >>> >>>      <bool name="literalReplacement">true</bool>
> >>>> > >>> >>> <processor>
> >>>> > >>> >>>
> >>>> > >>> >>> Regards,
> >>>> > >>> >>> Edwin
> >>>> > >>> >>>
> >>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
> >>>> jornfranke@gmail.com>
> >>>> > >>> wrote:
> >>>> > >>> >>>
> >>>> > >>> >>> > Then you need two regexprocessfactory steps
> >>>> > >>> >>> >
> >>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>>> > >>> >>> edwinyeozl@gmail.com
> >>>> > >>> >>> > >:
> >>>> > >>> >>> > >
> >>>> > >>> >>> > > Hi,
> >>>> > >>> >>> > >
> >>>> > >>> >>> > > Thanks for the reply.
> >>>> > >>> >>> > >
> >>>> > >>> >>> > > Do you know of any regex online tool that works
> correctly
> >>>> for
> >>>> > >>> Java
> >>>> > >>> >>> regex?
> >>>> > >>> >>> > > I tried to find some, but they are not working properly.
> >>>> > >>> >>> > >
> >>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
> >>>> <br><br>, and
> >>>> > >>> >>> single \n
> >>>> > >>> >>> > > with single <br>.
> >>>> > >>> >>> > >
> >>>> > >>> >>> > > Regards,
> >>>> > >>> >>> > > Edwin
> >>>> > >>> >>> > >
> >>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
> >>>> > jornfranke@gmail.com
> >>>> > >>> >
> >>>> > >>> >>> wrote:
> >>>> > >>> >>> > >>
> >>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a
> bug
> >>>> - it
> >>>> > >>> would
> >>>> > >>> >>> then
> >>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
> >>>> supports
> >>>> > Java
> >>>> > >>> >>> regex
> >>>> > >>> >>> > for
> >>>> > >>> >>> > >> your solution.
> >>>> > >>> >>> > >>
> >>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
> >>>> > >>> >>> > >> One that deals with single \n and one that deals with
> >>>> more
> >>>> > than
> >>>> > >>> one
> >>>> > >>> >>> \n
> >>>> > >>> >>> > >>
> >>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> >>>> > >>> >>> > edwinyeozl@gmail.com
> >>>> > >>> >>> > >>> :
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> Hi,
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> We have tried with the following pattern ([
> >>>> \t]*\r?\n){2,}
> >>>> > and
> >>>> > >>> >>> > >>> configuration:
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>> > >>>  <str name="fieldName">content</str>
> >>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
> >>>> > >>> >>> > >>> </processor>
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> However, the issue is still occurring.
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> Anyone else is able to help?
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> Regards,
> >>>> > >>> >>> > >>> Edwin
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> >>>> > >>> >>> > edwinyeozl@gmail.com>
> >>>> > >>> >>> > >>> wrote:
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>>> Hi,
> >>>> > >>> >>> > >>>>
> >>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0
> as
> >>>> > well.
> >>>> > >>> >>> > >>>>
> >>>> > >>> >>> > >>>> Regards,
> >>>> > >>> >>> > >>>> Edwin
> >>>> > >>> >>> > >>>>
> >>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> >>>> > >>> >>> > edwinyeozl@gmail.com
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>>> wrote:
> >>>> > >>> >>> > >>>>
> >>>> > >>> >>> > >>>>> Hi,
> >>>> > >>> >>> > >>>>>
> >>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
> >>>> > >>> >>> > >>>>>
> >>>> > >>> >>> > >>>>> Regards,
> >>>> > >>> >>> > >>>>> Edwin
> >>>> > >>> >>> > >>>>>
> >>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> >>>> > >>> >>> > edwinyeozl@gmail.com
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>>>> wrote:
> >>>> > >>> >>> > >>>>>
> >>>> > >>> >>> > >>>>>> Hi Paul,
> >>>> > >>> >>> > >>>>>>
> >>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
> >>>> when we
> >>>> > >>> try
> >>>> > >>> >>> in on
> >>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
> >>>> correct
> >>>> > >>> >>> result for
> >>>> > >>> >>> > >> all
> >>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
> >>>> <br><br>, and
> >>>> > >>> not
> >>>> > >>> >>> more
> >>>> > >>> >>> > >> than
> >>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our
> earlier
> >>>> > >>> examples).
> >>>> > >>> >>> > >>>>>>
> >>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> >>>> > >>> >>> > >>>>>>
> >>>> > >>> >>> > >>>>>> Regards,
> >>>> > >>> >>> > >>>>>> Edwin
> >>>> > >>> >>> > >>>>>>
> >>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >>>> > >>> >>> > >> edwinyeozl@gmail.com>
> >>>> > >>> >>> > >>>>>> wrote:
> >>>> > >>> >>> > >>>>>>
> >>>> > >>> >>> > >>>>>>> Hi Paul,
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
> >>>> i.e.
> >>>> > <str
> >>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the
> following
> >>>> > regex
> >>>> > >>> >>> pattern:
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> <processor
> >>>> class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
> >>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>>> > >>> >>> > >>>>>>>  <str
> name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>> > >>>>>>> </processor>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> However, we are also getting the exact same
> results
> >>>> as
> >>>> > the
> >>>> > >>> >>> earlier
> >>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you
> have
> >>>> other
> >>>> > >>> (non
> >>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
> >>>> there are
> >>>> > >>> no
> >>>> > >>> >>> non
> >>>> > >>> >>> > >> printing
> >>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You
> >>>> can
> >>>> > >>> refer
> >>>> > >>> >>> to the
> >>>> > >>> >>> > >>>>>>> original content in the same examples below.
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> working
> >>>> > >>> >>> > >>>>>>> correctly
> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
> >>>> > >>> >>> > >>>>>>> Dear Sir,
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> I am terminating
> >>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
> am
> >>>> > >>> terminating
> >>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> >>>> terminating
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
> >>>> are 4
> >>>> > >>> <br>)
> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> *exalted*
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> *Psalm 89:17*
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> >>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
> >>>>  \n\n
> >>>> > >>> >>>  \n\n  3
> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
> >>>>  <br><br>
> >>>> > >>> >>> <br><br>3
> >>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
> >>>> are 4
> >>>> > >>> <br>)
> >>>> > >>> >>> > >>>>>>> *Original content in EML file:*
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>> > >>> >>> > >>>>>>> *Original content:*
> >>>> http://www.concordpri.moe.edu.sg/
> >>>> > >>>  \n\n
> >>>> > >>> >>> >  \n\n
> >>>> > >>> >>> > >> \n
> >>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n
> \n\n\n
> >>>> > >>> \n\n\n  On
> >>>> > >>> >>> Tue,
> >>>> > >>> >>> > >> Dec 18,
> >>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
> >>>> > >>> >>> > >>>>>>> *Index content: *
> http://www.concordpri.moe.edu.sg/
> >>>> > >>>  <br><br>
> >>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you
> >>>> may
> >>>> > >>> have.
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Thank you.
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>> Regards,
> >>>> > >>> >>> > >>>>>>> Edwin
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <
> >>>> paul.dodd@ub.unibe.ch>
> >>>> > >>> wrote:
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Hi Edwin
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space
> should
> >>>> > preceed
> >>>> > >>> >>> the \n
> >>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
> >>>> printing)
> >>>> > >>> >>> characters
> >>>> > >>> >>> > >>>>>>>> than \n?
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>> > >>> >>> > >> für
> >>>> > >>> >>> > >>>>>>>> Windows 10
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
> >>>> edwinyeozl@gmail.com>
> >>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>>> > >>> >>> > solr-user@lucene.apache.org>
> >>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
> >>>> to
> >>>> > >>> detect
> >>>> > >>> >>> > >> multiple \n
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Hi Paul,
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
> >>>> follow:
> >>>> > >>> >>> > >>>>>>>> <processor
> >>>> class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
> >>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>>> > >>> >>> > >>>>>>>>  <str
> name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>> > >>>>>>>> </processor>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
> >>>> Example
> >>>> > 1,2
> >>>> > >>> and
> >>>> > >>> >>> 3
> >>>> > >>> >>> > >> below.
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> working
> >>>> > >>> >>> > >>>>>>>> correctly
> >>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
> am
> >>>> > >>> >>> terminating
> >>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> >>>> terminating
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>>> working
> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
> >>>> <br>)
> >>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm
> 89:17
> >>>> >  \n\n
> >>>> > >>> >>>  \n\n
> >>>> > >>> >>> > 3
> >>>> > >>> >>> > >>>>>>>> Choa
> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
> >>>>  <br><br>
> >>>> > >>> >>> > <br><br>3
> >>>> > >>> >>> > >>>>>>>> Choa
> >>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>>> working
> >>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
> >>>> <br>)
> >>>> > >>> >>> > >>>>>>>> *Original content:*
> >>>> http://www.concordpri.moe.edu.sg/
> >>>> > >>>  \n\n
> >>>> > >>> >>> >  \n\n
> >>>> > >>> >>> > >>>>>>>> \n \n\n
> >>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> >>>> \n\n\n
> >>>> > On
> >>>> > >>> >>> Tue, Dec
> >>>> > >>> >>> > >> 18,
> >>>> > >>> >>> > >>>>>>>> 2018
> >>>> > >>> >>> > >>>>>>>> at 10:07 AM
> >>>> > >>> >>> > >>>>>>>> *Index content: *
> http://www.concordpri.moe.edu.sg/
> >>>> > >>>  <br><br>
> >>>> > >>> >>> > >>>>>>>> <br><br>On
> >>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Any further suggestion?
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Thank you.
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>> Regards,
> >>>> > >>> >>> > >>>>>>>> Edwin
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
> >>>> paul.dodd@ub.unibe.ch>
> >>>> > >>> wrote:
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
> >>>> then
> >>>> > >>> failing
> >>>> > >>> >>> on
> >>>> > >>> >>> > the
> >>>> > >>> >>> > >>>>>>>> {2,}
> >>>> > >>> >>> > >>>>>>>>> part you could try
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
> >>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> >>>> > >>> >>> > >
> >>>> > >>> >>> > >>>>>>>> für
> >>>> > >>> >>> > >>>>>>>>> Windows 10
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
> >>>> edwinyeozl@gmail.com>
> >>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>>> > >>> >>> > solr-user@lucene.apache.org
> >>>> > >>> >>> > >>>
> >>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory
> pattern
> >>>> to
> >>>> > >>> detect
> >>>> > >>> >>> > >> multiple
> >>>> > >>> >>> > >>>>>>>> \n
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Hi Paul,
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> When I use this pattern:
> >>>> > >>> >>> > >>>>>>>>> <processor
> >>>> class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> >>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>>> > >>> >>> > >>>>>>>>>  <str
> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>> > >>>>>>>>> </processor>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
> >>>> content
> >>>> > >>> and
> >>>> > >>> >>> not
> >>>> > >>> >>> > >>>>>>>> working for
> >>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one
> that
> >>>> is
> >>>> > >>> working
> >>>> > >>> >>> and
> >>>> > >>> >>> > >>>>>>>> another
> >>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> working
> >>>> > >>> >>> > >>>>>>>> correctly
> >>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
> >>>> am
> >>>> > >>> >>> terminating
> >>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> >>>> > terminating
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>>> working
> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
> >>>> <br>)
> >>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm
> 89:17
> >>>> >  \n\n
> >>>> > >>> >>> >  \n\n  3
> >>>> > >>> >>> > >>>>>>>> Choa
> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
> >>>> >  <br><br>
> >>>> > >>> >>> > <br><br>3
> >>>> > >>> >>> > >>>>>>>> Choa
> >>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
> >>>> pattern is
> >>>> > >>> >>> partially
> >>>> > >>> >>> > >>>>>>>> working
> >>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
> >>>> <br>)
> >>>> > >>> >>> > >>>>>>>>> *Original content:*
> >>>> http://www.concordpri.moe.edu.sg/
> >>>> > >>>  \n\n
> >>>> > >>> >>> > >> \n\n
> >>>> > >>> >>> > >>>>>>>> \n
> >>>> > >>> >>> > >>>>>>>>> \n\n
> >>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> >>>> \n\n\n
> >>>> > On
> >>>> > >>> >>> Tue,
> >>>> > >>> >>> > Dec
> >>>> > >>> >>> > >>>>>>>> 18, 2018
> >>>> > >>> >>> > >>>>>>>>> at 10:07 AM
> >>>> > >>> >>> > >>>>>>>>> *Index content: *
> >>>> http://www.concordpri.moe.edu.sg/
> >>>> > >>> >>>  <br><br>
> >>>> > >>> >>> > >>>>>>>> <br><br>On
> >>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is
> >>>> wrong?
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Thank you.
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>> Regards,
> >>>> > >>> >>> > >>>>>>>>> Edwin
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
> >>>> paul.dodd@ub.unibe.ch>
> >>>> > >>> wrote:
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
> >>>> > >>> working. I
> >>>> > >>> >>> > assume
> >>>> > >>> >>> > >>>>>>>> nothing
> >>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> ??
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
> >>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>> > >>> >>> > >>>>>>>> für
> >>>> > >>> >>> > >>>>>>>>>> Windows 10
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
> >>>> edwinyeozl@gmail.com>
> >>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>>> > >>> >>> > >> solr-user@lucene.apache.org
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern
> to
> >>>> > detect
> >>>> > >>> >>> multiple
> >>>> > >>> >>> > >> \n
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Hi,
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> I am trying to use the
> >>>> RegexReplaceProcessorFactory to
> >>>> > >>> >>> remove
> >>>> > >>> >>> > more
> >>>> > >>> >>> > >>>>>>>> than
> >>>> > >>> >>> > >>>>>>>>> two
> >>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
> >>>> \n\n,
> >>>> > \n
> >>>> > >>> \n,
> >>>> > >>> >>> \n
> >>>> > >>> >>> > \n
> >>>> > >>> >>> > >>>>>>>> \n
> >>>> > >>> >>> > >>>>>>>>> \n),
> >>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
> >>>> working
> >>>> > >>> when I
> >>>> > >>> >>> test
> >>>> > >>> >>> > it
> >>>> > >>> >>> > >>>>>>>> in
> >>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put
> >>>> it
> >>>> > >>> inside
> >>>> > >>> >>> the
> >>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>> > >>> >>> > >>>>>>>>>> <processor
> >>>> class="solr.RegexReplaceProcessorFactory">
> >>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> >>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>>> > >>> >>> > >>>>>>>>>>  <str
> >>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>> > >>> >>> > >>>>>>>>>> </processor>
> >>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s*
> is
> >>>> > >>> >>> instructing
> >>>> > >>> >>> > the
> >>>> > >>> >>> > >>>>>>>> regex
> >>>> > >>> >>> > >>>>>>>>> to
> >>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
> >>>> > >>> instructing
> >>>> > >>> >>> the
> >>>> > >>> >>> > >>>>>>>> regex to
> >>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern
> (\n).
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
> >>>> should
> >>>> > >>> I do
> >>>> > >>> >>> it?
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>> Regards,
> >>>> > >>> >>> > >>>>>>>>>> Edwin
> >>>> > >>> >>> > >>>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>>
> >>>> > >>> >>> > >>>>>>>>
> >>>> > >>> >>> > >>>>>>>
> >>>> > >>> >>> > >>
> >>>> > >>> >>> >
> >>>> > >>> >>>
> >>>> > >>> >>
> >>>> > >>>
> >>>> > >>
> >>>> >
> >>>>
> >>>
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
Hi Edwin,
With \W you will also replace non-word characters such as punktuation. If that's OK fine. Otherwise you need to identify the white space characters that are causing the problem.
________________________________
Von: Zheng Lin Edwin Yeo <ed...@gmail.com>
Gesendet: Mittwoch, 13. März 2019 03:25:39
An: solr-user@lucene.apache.org
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Hi,

We have managed to resolve the issue, by changing the \s to \W. The reason
could be due to that some of the spaces and white space instead of just a
space. Using \s will only remove the spaces and not the white spaces, but
using \W will remove the white spaces as well.

We have used this config, and it works.

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\W*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\W*){1,}</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

Regards,
Edwin

On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Has anyone else faced the same issue before?
> So far all the regex patterns that we tried in this thread are not able to
> resolve the issue.
>
> Regards,
> Edwin
>
> On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Sorry, I realized there is an extra ']' in the pattern provided, which is
>> why there are so many <br> in the output.
>>
>> The output is exactly the same as previously (previous index result) if
>> we remove the extra ']', as shown in the configuration below.
>>
>>  <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>>  </processor>
>>  <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>>  </processor>
>>
>> Regards,
>> Edwin
>>
>>
>>
>> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Thanks for the reply.
>>>
>>> For the 2nd pattern, if we put this pattern <str
>>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
>>> configurations below:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>    <str name="replacement">&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>>
>>> It will not be able to change all those more than 3 <br> to 2 <br>.
>>>
>>> We will end up with many <br> in the output, like the example below:
>>>
>>>  http://www.concorded.com/<br><br>  <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> On Tue, Dec 18, 2018
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>>>
>>> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>> I can’t understand why the pattern is not working and where the spaces
>>>> between the <br> are coming from. It should be possible to allow for spaces
>>>> between the <br> in the second match pattern however i.e. 2nd pattern
>>>>
>>>>
>>>>
>>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>>>
>>>>
>>>>
>>>> /Paul
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I have tried with the first match pattern to be <str name="pattern">[
>>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>>>>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>>    <str name="replacement">&lt;br&gt;</str>
>>>>    <bool name="literalReplacement">true</bool>
>>>> </processor>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>    <bool name="literalReplacement">true</bool>
>>>> </processor>
>>>>
>>>> However, the result is still the same as before (previous index
>>>> results),
>>>> with the 4 <br>.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>>>>
>>>> > Hi Edwin
>>>> >
>>>> >
>>>> >
>>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>,
>>>> it’s
>>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first match
>>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>> >
>>>> >
>>>> >
>>>> > i.e. [space tab vertical-tab formfeed]
>>>> >
>>>> >
>>>> >
>>>> > Regards,
>>>> >
>>>> > Paul
>>>> >
>>>> >
>>>> >
>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> > Windows 10
>>>> >
>>>> >
>>>> >
>>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>> \n
>>>> >
>>>> >
>>>> >
>>>> > Hi Paul,
>>>> >
>>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
>>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of
>>>> (&lt;br&gt;&lt;br&gt;){3,}
>>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we have
>>>> put
>>>> > the <br> two times in the pattern, which is the reason that there are
>>>> more
>>>> > <br> in the result, as cases where there are less than 6 <br> are not
>>>> being
>>>> > replaced, so we ended up having up to 5 <br> in the index.
>>>> >
>>>> > Modified configuration:
>>>> >  <processor class="solr.RegexReplaceProcessorFactory">
>>>> >    <str name="fieldName">content</str>
>>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> >    <bool name="literalReplacement">true</bool>
>>>> >  </processor>
>>>> >
>>>> > This will bring us back to the result of the previous index content,
>>>> > meaning the issue of having the 4 <br> is still there.
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> >
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <
>>>> edwinyeozl@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Hi Paul,
>>>> > >
>>>> > > Further to my previous email, which there was an extra "}" in the
>>>> > > configuration, I have changed to use the below configuration based
>>>> on
>>>> > your
>>>> > > suggestion.
>>>> > >
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">[ \t]*\r?\n</str>
>>>> > >    <str name="replacement">&lt;br&gt;</str>
>>>> > >    <bool name="literalReplacement">true</bool>
>>>> > > </processor>
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >    <bool name="literalReplacement">true</bool>
>>>> > > </processor>
>>>> > >
>>>> > > However, the result that I get still has more than 2 <br>. In fact,
>>>> the
>>>> > > result become worse, as you can see from the comparison below.
>>>> > >
>>>> > > Example 1: The sentence that the regex pattern used to work
>>>> correctly.
>>>> > But
>>>> > > with the latest pattern, it has now changed from 2 <br> to become 5
>>>> <br>,
>>>> > > which is wrong.
>>>> > > *Original content in EML file:*
>>>> > > Dear Sir,
>>>> > >
>>>> > >
>>>> > > I am terminating
>>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>>>> > terminating
>>>> > >
>>>> > > Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > > *Original content in EML file:*
>>>> > >
>>>> > > *exalted*
>>>> > >
>>>> > > *Psalm 89:17*
>>>> > >
>>>> > >
>>>> > > 3 Choa Chu Kang Avenue 4
>>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa
>>>> > > Chu Kang Avenue 4, Singapore
>>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>>>> <br><br>  3
>>>> > > Choa Chu Kang Avenue 3, Singapor4
>>>> > >
>>>> > > Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the
>>>> latest
>>>> > code,
>>>> > > there are now 5 <br>
>>>> > > *Original content in EML file:*
>>>> > >
>>>> > > http://www.concorded.com/
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n
>>>> \n\n \n\n
>>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018 at
>>>> > > 10:07 AM
>>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>> > > *Current Index content:* http://www.concorded.com/<br><br>
>>>> <br><br><br>
>>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >
>>>> > >
>>>> > > Regards,
>>>> > > Edwin
>>>> > >
>>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>>>> edwinyeozl@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Paul,
>>>> > >>
>>>> > >> Thank you for the reply.
>>>> > >>
>>>> > >> I have tried to add the following configuration according to your
>>>> > >> suggestion:
>>>> > >>
>>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>    <str name="fieldName">content</str>
>>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>>>> > >>    <str name="replacement">&lt;br&gt;</str>
>>>> > >>    <bool name="literalReplacement">true</bool>
>>>> > >> </processor>
>>>> > >>
>>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>    <str name="fieldName">content</str>
>>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>    <bool name="literalReplacement">true</bool>
>>>> > >> </processor>
>>>> > >>
>>>> > >> However, none of the \n is being removed this time round.
>>>> > >> Is the order and/or the pattern correct?
>>>> > >>
>>>> > >> Regards,
>>>> > >> Edwin
>>>> > >>
>>>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>>>> > >>
>>>> > >>> Hi Edwin
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Try for the first pattern/replacement
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>>>> > >>>
>>>> > >>> <str name="replacement">&lt;br&gt;</str>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Now all line endings and preceding whitespace characters should be
>>>> > >>> changed to ‘<br>’.
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>>>> > sequences
>>>> > >>> to 2 ‘<br>’ sequences:
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >>>
>>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Hope this approach works. Sorry for not replying earlier and best
>>>> > >>> regards,
>>>> > >>>
>>>> > >>> Paul
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> für
>>>> > >>> Windows 10
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>>>> > >>> An: solr-user@lucene.apache.org<mailto:
>>>> solr-user@lucene.apache.org>
>>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>>> multiple \n
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Hi,
>>>> > >>>
>>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>>>> well.
>>>> > >>>
>>>> > >>> Regards,
>>>> > >>> Edwin
>>>> > >>>
>>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>>>> > edwinyeozl@gmail.com>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>> > Hi,
>>>> > >>> >
>>>> > >>> > Anyone else has other suggestions or have faced the same
>>>> problem?
>>>> > >>> >
>>>> > >>> > Regards,
>>>> > >>> > Edwin
>>>> > >>> >
>>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>>> > >>> edwinyeozl@gmail.com>
>>>> > >>> > wrote:
>>>> > >>> >
>>>> > >>> >> Hi Paul,
>>>> > >>> >>
>>>> > >>> >> If I tried to execute the second step first, then I will only
>>>> get a
>>>> > >>> >> single <br> for those with 2 <br>.
>>>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
>>>> with a
>>>> > >>> >> space in between.
>>>> > >>> >>
>>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since the
>>>> > second
>>>> > >>> >> step is to replace with a single <br>.
>>>> > >>> >> But it has not solved the underlying problem yet.
>>>> > >>> >>
>>>> > >>> >> Regards,
>>>> > >>> >> Edwin
>>>> > >>> >>
>>>> > >>> >>
>>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>>> > >>> >>
>>>> > >>> >>> If the second step is executed first, then you will get the
>>>> > unwanted
>>>> > >>> 4
>>>> > >>> >>> <br>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Gesendet von Mail<
>>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> für
>>>> > >>> >>> Windows 10
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>>>> solr-user@lucene.apache.org
>>>> > >
>>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>>> > multiple
>>>> > >>> \n
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Hi Jörn ,
>>>> > >>> >>>
>>>> > >>> >>> Do you mean the regex is not correct?
>>>> > >>> >>>
>>>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps,
>>>> like
>>>> > >>> the one
>>>> > >>> >>> shown below. The output that we get is still the same.
>>>> > >>> >>>
>>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>>      <str name="fieldName">content</str>
>>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>>> > >>> >>> <processor>
>>>> > >>> >>>
>>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>>      <str name="fieldName">content</str>
>>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>>> > >>> >>> <processor>
>>>> > >>> >>>
>>>> > >>> >>> Regards,
>>>> > >>> >>> Edwin
>>>> > >>> >>>
>>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
>>>> jornfranke@gmail.com>
>>>> > >>> wrote:
>>>> > >>> >>>
>>>> > >>> >>> > Then you need two regexprocessfactory steps
>>>> > >>> >>> >
>>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>>> > >>> >>> edwinyeozl@gmail.com
>>>> > >>> >>> > >:
>>>> > >>> >>> > >
>>>> > >>> >>> > > Hi,
>>>> > >>> >>> > >
>>>> > >>> >>> > > Thanks for the reply.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Do you know of any regex online tool that works correctly
>>>> for
>>>> > >>> Java
>>>> > >>> >>> regex?
>>>> > >>> >>> > > I tried to find some, but they are not working properly.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
>>>> <br><br>, and
>>>> > >>> >>> single \n
>>>> > >>> >>> > > with single <br>.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Regards,
>>>> > >>> >>> > > Edwin
>>>> > >>> >>> > >
>>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>>>> > jornfranke@gmail.com
>>>> > >>> >
>>>> > >>> >>> wrote:
>>>> > >>> >>> > >>
>>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug
>>>> - it
>>>> > >>> would
>>>> > >>> >>> then
>>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
>>>> supports
>>>> > Java
>>>> > >>> >>> regex
>>>> > >>> >>> > for
>>>> > >>> >>> > >> your solution.
>>>> > >>> >>> > >>
>>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>>>> > >>> >>> > >> One that deals with single \n and one that deals with
>>>> more
>>>> > than
>>>> > >>> one
>>>> > >>> >>> \n
>>>> > >>> >>> > >>
>>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>> :
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Hi,
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> We have tried with the following pattern ([
>>>> \t]*\r?\n){2,}
>>>> > and
>>>> > >>> >>> > >>> configuration:
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>>> > >>> >>> > >>> </processor>
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> However, the issue is still occurring.
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Anyone else is able to help?
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Regards,
>>>> > >>> >>> > >>> Edwin
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com>
>>>> > >>> >>> > >>> wrote:
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>> Hi,
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
>>>> > well.
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> Regards,
>>>> > >>> >>> > >>>> Edwin
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>> wrote:
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>>> Hi,
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> Regards,
>>>> > >>> >>> > >>>>> Edwin
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>>> wrote:
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>>>> when we
>>>> > >>> try
>>>> > >>> >>> in on
>>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>>>> correct
>>>> > >>> >>> result for
>>>> > >>> >>> > >> all
>>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
>>>> <br><br>, and
>>>> > >>> not
>>>> > >>> >>> more
>>>> > >>> >>> > >> than
>>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>>> > >>> examples).
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Regards,
>>>> > >>> >>> > >>>>>> Edwin
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > >> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>> wrote:
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
>>>> i.e.
>>>> > <str
>>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
>>>> > regex
>>>> > >>> >>> pattern:
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>>> > >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> However, we are also getting the exact same results
>>>> as
>>>> > the
>>>> > >>> >>> earlier
>>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have
>>>> other
>>>> > >>> (non
>>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
>>>> there are
>>>> > >>> no
>>>> > >>> >>> non
>>>> > >>> >>> > >> printing
>>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You
>>>> can
>>>> > >>> refer
>>>> > >>> >>> to the
>>>> > >>> >>> > >>>>>>> original content in the same examples below.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> working
>>>> > >>> >>> > >>>>>>> correctly
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>> Dear Sir,
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> I am terminating
>>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>>> > >>> terminating
>>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> terminating
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>>> are 4
>>>> > >>> <br>)
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> *exalted*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> *Psalm 89:17*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>>  \n\n
>>>> > >>> >>>  \n\n  3
>>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>>  <br><br>
>>>> > >>> >>> <br><br>3
>>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>>> are 4
>>>> > >>> <br>)
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> >  \n\n
>>>> > >>> >>> > >> \n
>>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> > >>> \n\n\n  On
>>>> > >>> >>> Tue,
>>>> > >>> >>> > >> Dec 18,
>>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>> > >>>  <br><br>
>>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you
>>>> may
>>>> > >>> have.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Regards,
>>>> > >>> >>> > >>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Hi Edwin
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
>>>> > preceed
>>>> > >>> >>> the \n
>>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
>>>> printing)
>>>> > >>> >>> characters
>>>> > >>> >>> > >>>>>>>> than \n?
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> >>> > >> für
>>>> > >>> >>> > >>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > solr-user@lucene.apache.org>
>>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
>>>> to
>>>> > >>> detect
>>>> > >>> >>> > >> multiple \n
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
>>>> follow:
>>>> > >>> >>> > >>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
>>>> Example
>>>> > 1,2
>>>> > >>> and
>>>> > >>> >>> 3
>>>> > >>> >>> > >> below.
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> working
>>>> > >>> >>> > >>>>>>>> correctly
>>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>>> > >>> >>> terminating
>>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> terminating
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>> >  \n\n
>>>> > >>> >>>  \n\n
>>>> > >>> >>> > 3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>>  <br><br>
>>>> > >>> >>> > <br><br>3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> >  \n\n
>>>> > >>> >>> > >>>>>>>> \n \n\n
>>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> \n\n\n
>>>> > On
>>>> > >>> >>> Tue, Dec
>>>> > >>> >>> > >> 18,
>>>> > >>> >>> > >>>>>>>> 2018
>>>> > >>> >>> > >>>>>>>> at 10:07 AM
>>>> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>> > >>>  <br><br>
>>>> > >>> >>> > >>>>>>>> <br><br>On
>>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Any further suggestion?
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
>>>> then
>>>> > >>> failing
>>>> > >>> >>> on
>>>> > >>> >>> > the
>>>> > >>> >>> > >>>>>>>> {2,}
>>>> > >>> >>> > >>>>>>>>> part you could try
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>>> > >>> >>> > >
>>>> > >>> >>> > >>>>>>>> für
>>>> > >>> >>> > >>>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > solr-user@lucene.apache.org
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
>>>> to
>>>> > >>> detect
>>>> > >>> >>> > >> multiple
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> When I use this pattern:
>>>> > >>> >>> > >>>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>  <str
>>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>>>> content
>>>> > >>> and
>>>> > >>> >>> not
>>>> > >>> >>> > >>>>>>>> working for
>>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that
>>>> is
>>>> > >>> working
>>>> > >>> >>> and
>>>> > >>> >>> > >>>>>>>> another
>>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> working
>>>> > >>> >>> > >>>>>>>> correctly
>>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
>>>> am
>>>> > >>> >>> terminating
>>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> > terminating
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>> >  \n\n
>>>> > >>> >>> >  \n\n  3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>> >  <br><br>
>>>> > >>> >>> > <br><br>3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> > >> \n\n
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>> \n\n
>>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> \n\n\n
>>>> > On
>>>> > >>> >>> Tue,
>>>> > >>> >>> > Dec
>>>> > >>> >>> > >>>>>>>> 18, 2018
>>>> > >>> >>> > >>>>>>>>> at 10:07 AM
>>>> > >>> >>> > >>>>>>>>> *Index content: *
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>> >>>  <br><br>
>>>> > >>> >>> > >>>>>>>> <br><br>On
>>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is
>>>> wrong?
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>>> > >>> working. I
>>>> > >>> >>> > assume
>>>> > >>> >>> > >>>>>>>> nothing
>>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> ??
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> >>> > >>>>>>>> für
>>>> > >>> >>> > >>>>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > >> solr-user@lucene.apache.org
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
>>>> > detect
>>>> > >>> >>> multiple
>>>> > >>> >>> > >> \n
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Hi,
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I am trying to use the
>>>> RegexReplaceProcessorFactory to
>>>> > >>> >>> remove
>>>> > >>> >>> > more
>>>> > >>> >>> > >>>>>>>> than
>>>> > >>> >>> > >>>>>>>>> two
>>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>>>> \n\n,
>>>> > \n
>>>> > >>> \n,
>>>> > >>> >>> \n
>>>> > >>> >>> > \n
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>> \n),
>>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
>>>> working
>>>> > >>> when I
>>>> > >>> >>> test
>>>> > >>> >>> > it
>>>> > >>> >>> > >>>>>>>> in
>>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put
>>>> it
>>>> > >>> inside
>>>> > >>> >>> the
>>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>> > >>> >>> > >>>>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>>> > >>> >>> > >>>>>>>>>>  <str
>>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>>> > >>> >>> instructing
>>>> > >>> >>> > the
>>>> > >>> >>> > >>>>>>>> regex
>>>> > >>> >>> > >>>>>>>>> to
>>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>>> > >>> instructing
>>>> > >>> >>> the
>>>> > >>> >>> > >>>>>>>> regex to
>>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
>>>> should
>>>> > >>> I do
>>>> > >>> >>> it?
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>
>>>> > >>> >>> >
>>>> > >>> >>>
>>>> > >>> >>
>>>> > >>>
>>>> > >>
>>>> >
>>>>
>>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

We have managed to resolve the issue, by changing the \s to \W. The reason
could be due to that some of the spaces and white space instead of just a
space. Using \s will only remove the spaces and not the white spaces, but
using \W will remove the white spaces as well.

We have used this config, and it works.

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\W*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\W*){1,}</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

Regards,
Edwin

On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Has anyone else faced the same issue before?
> So far all the regex patterns that we tried in this thread are not able to
> resolve the issue.
>
> Regards,
> Edwin
>
> On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Sorry, I realized there is an extra ']' in the pattern provided, which is
>> why there are so many <br> in the output.
>>
>> The output is exactly the same as previously (previous index result) if
>> we remove the extra ']', as shown in the configuration below.
>>
>>  <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>>  </processor>
>>  <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>>  </processor>
>>
>> Regards,
>> Edwin
>>
>>
>>
>> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Thanks for the reply.
>>>
>>> For the 2nd pattern, if we put this pattern <str
>>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
>>> configurations below:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>    <str name="replacement">&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>>
>>> It will not be able to change all those more than 3 <br> to 2 <br>.
>>>
>>> We will end up with many <br> in the output, like the example below:
>>>
>>>  http://www.concorded.com/<br><br>  <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> On Tue, Dec 18, 2018
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>>>
>>> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>> I can’t understand why the pattern is not working and where the spaces
>>>> between the <br> are coming from. It should be possible to allow for spaces
>>>> between the <br> in the second match pattern however i.e. 2nd pattern
>>>>
>>>>
>>>>
>>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>>>
>>>>
>>>>
>>>> /Paul
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> I have tried with the first match pattern to be <str name="pattern">[
>>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>>>>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>>    <str name="replacement">&lt;br&gt;</str>
>>>>    <bool name="literalReplacement">true</bool>
>>>> </processor>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>    <bool name="literalReplacement">true</bool>
>>>> </processor>
>>>>
>>>> However, the result is still the same as before (previous index
>>>> results),
>>>> with the 4 <br>.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>>>>
>>>> > Hi Edwin
>>>> >
>>>> >
>>>> >
>>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>,
>>>> it’s
>>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first match
>>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>> >
>>>> >
>>>> >
>>>> > i.e. [space tab vertical-tab formfeed]
>>>> >
>>>> >
>>>> >
>>>> > Regards,
>>>> >
>>>> > Paul
>>>> >
>>>> >
>>>> >
>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> > Windows 10
>>>> >
>>>> >
>>>> >
>>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>> \n
>>>> >
>>>> >
>>>> >
>>>> > Hi Paul,
>>>> >
>>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
>>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of
>>>> (&lt;br&gt;&lt;br&gt;){3,}
>>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we have
>>>> put
>>>> > the <br> two times in the pattern, which is the reason that there are
>>>> more
>>>> > <br> in the result, as cases where there are less than 6 <br> are not
>>>> being
>>>> > replaced, so we ended up having up to 5 <br> in the index.
>>>> >
>>>> > Modified configuration:
>>>> >  <processor class="solr.RegexReplaceProcessorFactory">
>>>> >    <str name="fieldName">content</str>
>>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> >    <bool name="literalReplacement">true</bool>
>>>> >  </processor>
>>>> >
>>>> > This will bring us back to the result of the previous index content,
>>>> > meaning the issue of having the 4 <br> is still there.
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> >
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <
>>>> edwinyeozl@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Hi Paul,
>>>> > >
>>>> > > Further to my previous email, which there was an extra "}" in the
>>>> > > configuration, I have changed to use the below configuration based
>>>> on
>>>> > your
>>>> > > suggestion.
>>>> > >
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">[ \t]*\r?\n</str>
>>>> > >    <str name="replacement">&lt;br&gt;</str>
>>>> > >    <bool name="literalReplacement">true</bool>
>>>> > > </processor>
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >    <bool name="literalReplacement">true</bool>
>>>> > > </processor>
>>>> > >
>>>> > > However, the result that I get still has more than 2 <br>. In fact,
>>>> the
>>>> > > result become worse, as you can see from the comparison below.
>>>> > >
>>>> > > Example 1: The sentence that the regex pattern used to work
>>>> correctly.
>>>> > But
>>>> > > with the latest pattern, it has now changed from 2 <br> to become 5
>>>> <br>,
>>>> > > which is wrong.
>>>> > > *Original content in EML file:*
>>>> > > Dear Sir,
>>>> > >
>>>> > >
>>>> > > I am terminating
>>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>>>> > terminating
>>>> > >
>>>> > > Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > > *Original content in EML file:*
>>>> > >
>>>> > > *exalted*
>>>> > >
>>>> > > *Psalm 89:17*
>>>> > >
>>>> > >
>>>> > > 3 Choa Chu Kang Avenue 4
>>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa
>>>> > > Chu Kang Avenue 4, Singapore
>>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>>>> <br><br>  3
>>>> > > Choa Chu Kang Avenue 3, Singapor4
>>>> > >
>>>> > > Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the
>>>> latest
>>>> > code,
>>>> > > there are now 5 <br>
>>>> > > *Original content in EML file:*
>>>> > >
>>>> > > http://www.concorded.com/
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n
>>>> \n\n \n\n
>>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018 at
>>>> > > 10:07 AM
>>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>> > > *Current Index content:* http://www.concorded.com/<br><br>
>>>> <br><br><br>
>>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >
>>>> > >
>>>> > > Regards,
>>>> > > Edwin
>>>> > >
>>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>>>> edwinyeozl@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Paul,
>>>> > >>
>>>> > >> Thank you for the reply.
>>>> > >>
>>>> > >> I have tried to add the following configuration according to your
>>>> > >> suggestion:
>>>> > >>
>>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>    <str name="fieldName">content</str>
>>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>>>> > >>    <str name="replacement">&lt;br&gt;</str>
>>>> > >>    <bool name="literalReplacement">true</bool>
>>>> > >> </processor>
>>>> > >>
>>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>    <str name="fieldName">content</str>
>>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>    <bool name="literalReplacement">true</bool>
>>>> > >> </processor>
>>>> > >>
>>>> > >> However, none of the \n is being removed this time round.
>>>> > >> Is the order and/or the pattern correct?
>>>> > >>
>>>> > >> Regards,
>>>> > >> Edwin
>>>> > >>
>>>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>>>> > >>
>>>> > >>> Hi Edwin
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Try for the first pattern/replacement
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>>>> > >>>
>>>> > >>> <str name="replacement">&lt;br&gt;</str>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Now all line endings and preceding whitespace characters should be
>>>> > >>> changed to ‘<br>’.
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>>>> > sequences
>>>> > >>> to 2 ‘<br>’ sequences:
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>> > >>>
>>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Hope this approach works. Sorry for not replying earlier and best
>>>> > >>> regards,
>>>> > >>>
>>>> > >>> Paul
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> für
>>>> > >>> Windows 10
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>>>> > >>> An: solr-user@lucene.apache.org<mailto:
>>>> solr-user@lucene.apache.org>
>>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>>> multiple \n
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> Hi,
>>>> > >>>
>>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>>>> well.
>>>> > >>>
>>>> > >>> Regards,
>>>> > >>> Edwin
>>>> > >>>
>>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>>>> > edwinyeozl@gmail.com>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>> > Hi,
>>>> > >>> >
>>>> > >>> > Anyone else has other suggestions or have faced the same
>>>> problem?
>>>> > >>> >
>>>> > >>> > Regards,
>>>> > >>> > Edwin
>>>> > >>> >
>>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>>> > >>> edwinyeozl@gmail.com>
>>>> > >>> > wrote:
>>>> > >>> >
>>>> > >>> >> Hi Paul,
>>>> > >>> >>
>>>> > >>> >> If I tried to execute the second step first, then I will only
>>>> get a
>>>> > >>> >> single <br> for those with 2 <br>.
>>>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
>>>> with a
>>>> > >>> >> space in between.
>>>> > >>> >>
>>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since the
>>>> > second
>>>> > >>> >> step is to replace with a single <br>.
>>>> > >>> >> But it has not solved the underlying problem yet.
>>>> > >>> >>
>>>> > >>> >> Regards,
>>>> > >>> >> Edwin
>>>> > >>> >>
>>>> > >>> >>
>>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>>> > >>> >>
>>>> > >>> >>> If the second step is executed first, then you will get the
>>>> > unwanted
>>>> > >>> 4
>>>> > >>> >>> <br>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Gesendet von Mail<
>>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> für
>>>> > >>> >>> Windows 10
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>>>> solr-user@lucene.apache.org
>>>> > >
>>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>>> > multiple
>>>> > >>> \n
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>>
>>>> > >>> >>> Hi Jörn ,
>>>> > >>> >>>
>>>> > >>> >>> Do you mean the regex is not correct?
>>>> > >>> >>>
>>>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps,
>>>> like
>>>> > >>> the one
>>>> > >>> >>> shown below. The output that we get is still the same.
>>>> > >>> >>>
>>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>>      <str name="fieldName">content</str>
>>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>>> > >>> >>> <processor>
>>>> > >>> >>>
>>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>>      <str name="fieldName">content</str>
>>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>>> > >>> >>> <processor>
>>>> > >>> >>>
>>>> > >>> >>> Regards,
>>>> > >>> >>> Edwin
>>>> > >>> >>>
>>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
>>>> jornfranke@gmail.com>
>>>> > >>> wrote:
>>>> > >>> >>>
>>>> > >>> >>> > Then you need two regexprocessfactory steps
>>>> > >>> >>> >
>>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>>> > >>> >>> edwinyeozl@gmail.com
>>>> > >>> >>> > >:
>>>> > >>> >>> > >
>>>> > >>> >>> > > Hi,
>>>> > >>> >>> > >
>>>> > >>> >>> > > Thanks for the reply.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Do you know of any regex online tool that works correctly
>>>> for
>>>> > >>> Java
>>>> > >>> >>> regex?
>>>> > >>> >>> > > I tried to find some, but they are not working properly.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
>>>> <br><br>, and
>>>> > >>> >>> single \n
>>>> > >>> >>> > > with single <br>.
>>>> > >>> >>> > >
>>>> > >>> >>> > > Regards,
>>>> > >>> >>> > > Edwin
>>>> > >>> >>> > >
>>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>>>> > jornfranke@gmail.com
>>>> > >>> >
>>>> > >>> >>> wrote:
>>>> > >>> >>> > >>
>>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug
>>>> - it
>>>> > >>> would
>>>> > >>> >>> then
>>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
>>>> supports
>>>> > Java
>>>> > >>> >>> regex
>>>> > >>> >>> > for
>>>> > >>> >>> > >> your solution.
>>>> > >>> >>> > >>
>>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>>>> > >>> >>> > >> One that deals with single \n and one that deals with
>>>> more
>>>> > than
>>>> > >>> one
>>>> > >>> >>> \n
>>>> > >>> >>> > >>
>>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>> :
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Hi,
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> We have tried with the following pattern ([
>>>> \t]*\r?\n){2,}
>>>> > and
>>>> > >>> >>> > >>> configuration:
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>>> > >>> >>> > >>> </processor>
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> However, the issue is still occurring.
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Anyone else is able to help?
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> Regards,
>>>> > >>> >>> > >>> Edwin
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com>
>>>> > >>> >>> > >>> wrote:
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>> Hi,
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
>>>> > well.
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> Regards,
>>>> > >>> >>> > >>>> Edwin
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>> wrote:
>>>> > >>> >>> > >>>>
>>>> > >>> >>> > >>>>> Hi,
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> Regards,
>>>> > >>> >>> > >>>>> Edwin
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > edwinyeozl@gmail.com
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>>> wrote:
>>>> > >>> >>> > >>>>>
>>>> > >>> >>> > >>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>>>> when we
>>>> > >>> try
>>>> > >>> >>> in on
>>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>>>> correct
>>>> > >>> >>> result for
>>>> > >>> >>> > >> all
>>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
>>>> <br><br>, and
>>>> > >>> not
>>>> > >>> >>> more
>>>> > >>> >>> > >> than
>>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>>> > >>> examples).
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> Regards,
>>>> > >>> >>> > >>>>>> Edwin
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>>> > >>> >>> > >> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>> wrote:
>>>> > >>> >>> > >>>>>>
>>>> > >>> >>> > >>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
>>>> i.e.
>>>> > <str
>>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
>>>> > regex
>>>> > >>> >>> pattern:
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>>> > >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> However, we are also getting the exact same results
>>>> as
>>>> > the
>>>> > >>> >>> earlier
>>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have
>>>> other
>>>> > >>> (non
>>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
>>>> there are
>>>> > >>> no
>>>> > >>> >>> non
>>>> > >>> >>> > >> printing
>>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You
>>>> can
>>>> > >>> refer
>>>> > >>> >>> to the
>>>> > >>> >>> > >>>>>>> original content in the same examples below.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> working
>>>> > >>> >>> > >>>>>>> correctly
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>> Dear Sir,
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> I am terminating
>>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>>> > >>> terminating
>>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> terminating
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>>> are 4
>>>> > >>> <br>)
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> *exalted*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> *Psalm 89:17*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>>  \n\n
>>>> > >>> >>>  \n\n  3
>>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>>  <br><br>
>>>> > >>> >>> <br><br>3
>>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>>> are 4
>>>> > >>> <br>)
>>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> >  \n\n
>>>> > >>> >>> > >> \n
>>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> > >>> \n\n\n  On
>>>> > >>> >>> Tue,
>>>> > >>> >>> > >> Dec 18,
>>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>> > >>>  <br><br>
>>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you
>>>> may
>>>> > >>> have.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>> Regards,
>>>> > >>> >>> > >>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Hi Edwin
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
>>>> > preceed
>>>> > >>> >>> the \n
>>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
>>>> printing)
>>>> > >>> >>> characters
>>>> > >>> >>> > >>>>>>>> than \n?
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> >>> > >> für
>>>> > >>> >>> > >>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > solr-user@lucene.apache.org>
>>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
>>>> to
>>>> > >>> detect
>>>> > >>> >>> > >> multiple \n
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
>>>> follow:
>>>> > >>> >>> > >>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
>>>> Example
>>>> > 1,2
>>>> > >>> and
>>>> > >>> >>> 3
>>>> > >>> >>> > >> below.
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> working
>>>> > >>> >>> > >>>>>>>> correctly
>>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>>> > >>> >>> terminating
>>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> terminating
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>> >  \n\n
>>>> > >>> >>>  \n\n
>>>> > >>> >>> > 3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>>  <br><br>
>>>> > >>> >>> > <br><br>3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> >  \n\n
>>>> > >>> >>> > >>>>>>>> \n \n\n
>>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> \n\n\n
>>>> > On
>>>> > >>> >>> Tue, Dec
>>>> > >>> >>> > >> 18,
>>>> > >>> >>> > >>>>>>>> 2018
>>>> > >>> >>> > >>>>>>>> at 10:07 AM
>>>> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>> > >>>  <br><br>
>>>> > >>> >>> > >>>>>>>> <br><br>On
>>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Any further suggestion?
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and
>>>> then
>>>> > >>> failing
>>>> > >>> >>> on
>>>> > >>> >>> > the
>>>> > >>> >>> > >>>>>>>> {2,}
>>>> > >>> >>> > >>>>>>>>> part you could try
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>>> > >>> >>> > >
>>>> > >>> >>> > >>>>>>>> für
>>>> > >>> >>> > >>>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > solr-user@lucene.apache.org
>>>> > >>> >>> > >>>
>>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
>>>> to
>>>> > >>> detect
>>>> > >>> >>> > >> multiple
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Hi Paul,
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> When I use this pattern:
>>>> > >>> >>> > >>>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>>> > >>> >>> > >>>>>>>>>  <str
>>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>>>> content
>>>> > >>> and
>>>> > >>> >>> not
>>>> > >>> >>> > >>>>>>>> working for
>>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that
>>>> is
>>>> > >>> working
>>>> > >>> >>> and
>>>> > >>> >>> > >>>>>>>> another
>>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> working
>>>> > >>> >>> > >>>>>>>> correctly
>>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I
>>>> am
>>>> > >>> >>> terminating
>>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>>> > terminating
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>> >  \n\n
>>>> > >>> >>> >  \n\n  3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>> >  <br><br>
>>>> > >>> >>> > <br><br>3
>>>> > >>> >>> > >>>>>>>> Choa
>>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>>>> pattern is
>>>> > >>> >>> partially
>>>> > >>> >>> > >>>>>>>> working
>>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>>> <br>)
>>>> > >>> >>> > >>>>>>>>> *Original content:*
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>>  \n\n
>>>> > >>> >>> > >> \n\n
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>> \n\n
>>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>>> \n\n\n
>>>> > On
>>>> > >>> >>> Tue,
>>>> > >>> >>> > Dec
>>>> > >>> >>> > >>>>>>>> 18, 2018
>>>> > >>> >>> > >>>>>>>>> at 10:07 AM
>>>> > >>> >>> > >>>>>>>>> *Index content: *
>>>> http://www.concordpri.moe.edu.sg/
>>>> > >>> >>>  <br><br>
>>>> > >>> >>> > >>>>>>>> <br><br>On
>>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is
>>>> wrong?
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Thank you.
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>>>> paul.dodd@ub.unibe.ch>
>>>> > >>> wrote:
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>>> > >>> working. I
>>>> > >>> >>> > assume
>>>> > >>> >>> > >>>>>>>> nothing
>>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> ??
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> > >>> >>> > >>>>>>>> für
>>>> > >>> >>> > >>>>>>>>>> Windows 10
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>>> edwinyeozl@gmail.com>
>>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>>> > >>> >>> > >> solr-user@lucene.apache.org
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
>>>> > detect
>>>> > >>> >>> multiple
>>>> > >>> >>> > >> \n
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Hi,
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I am trying to use the
>>>> RegexReplaceProcessorFactory to
>>>> > >>> >>> remove
>>>> > >>> >>> > more
>>>> > >>> >>> > >>>>>>>> than
>>>> > >>> >>> > >>>>>>>>> two
>>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>>>> \n\n,
>>>> > \n
>>>> > >>> \n,
>>>> > >>> >>> \n
>>>> > >>> >>> > \n
>>>> > >>> >>> > >>>>>>>> \n
>>>> > >>> >>> > >>>>>>>>> \n),
>>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
>>>> working
>>>> > >>> when I
>>>> > >>> >>> test
>>>> > >>> >>> > it
>>>> > >>> >>> > >>>>>>>> in
>>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put
>>>> it
>>>> > >>> inside
>>>> > >>> >>> the
>>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>> > >>> >>> > >>>>>>>>>> <processor
>>>> class="solr.RegexReplaceProcessorFactory">
>>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>>> > >>> >>> > >>>>>>>>>>  <str
>>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > >>> >>> > >>>>>>>>>> </processor>
>>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>>> > >>> >>> instructing
>>>> > >>> >>> > the
>>>> > >>> >>> > >>>>>>>> regex
>>>> > >>> >>> > >>>>>>>>> to
>>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>>> > >>> instructing
>>>> > >>> >>> the
>>>> > >>> >>> > >>>>>>>> regex to
>>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
>>>> should
>>>> > >>> I do
>>>> > >>> >>> it?
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>> Regards,
>>>> > >>> >>> > >>>>>>>>>> Edwin
>>>> > >>> >>> > >>>>>>>>>>
>>>> > >>> >>> > >>>>>>>>>
>>>> > >>> >>> > >>>>>>>>
>>>> > >>> >>> > >>>>>>>
>>>> > >>> >>> > >>
>>>> > >>> >>> >
>>>> > >>> >>>
>>>> > >>> >>
>>>> > >>>
>>>> > >>
>>>> >
>>>>
>>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

Has anyone else faced the same issue before?
So far all the regex patterns that we tried in this thread are not able to
resolve the issue.

Regards,
Edwin

On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Sorry, I realized there is an extra ']' in the pattern provided, which is
> why there are so many <br> in the output.
>
> The output is exactly the same as previously (previous index result) if we
> remove the extra ']', as shown in the configuration below.
>
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>
> Regards,
> Edwin
>
>
>
> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Thanks for the reply.
>>
>> For the 2nd pattern, if we put this pattern <str
>> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
>> configurations below:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> It will not be able to change all those more than 3 <br> to 2 <br>.
>>
>> We will end up with many <br> in the output, like the example below:
>>
>>  http://www.concorded.com/<br><br>  <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> On Tue, Dec 18, 2018
>>
>>
>> Regards,
>> Edwin
>>
>>
>>
>>
>> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> I can’t understand why the pattern is not working and where the spaces
>>> between the <br> are coming from. It should be possible to allow for spaces
>>> between the <br> in the second match pattern however i.e. 2nd pattern
>>>
>>>
>>>
>>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>>
>>>
>>>
>>> /Paul
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> I have tried with the first match pattern to be <str name="pattern">[
>>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>>    <str name="replacement">&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>    <bool name="literalReplacement">true</bool>
>>> </processor>
>>>
>>> However, the result is still the same as before (previous index results),
>>> with the 4 <br>.
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>>>
>>> > Hi Edwin
>>> >
>>> >
>>> >
>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>,
>>> it’s
>>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first match
>>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>> >
>>> >
>>> >
>>> > i.e. [space tab vertical-tab formfeed]
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Paul
>>> >
>>> >
>>> >
>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
>>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
>>> > will actually look for 6 or more <br> instead of 3 <br>,  as we have
>>> put
>>> > the <br> two times in the pattern, which is the reason that there are
>>> more
>>> > <br> in the result, as cases where there are less than 6 <br> are not
>>> being
>>> > replaced, so we ended up having up to 5 <br> in the index.
>>> >
>>> > Modified configuration:
>>> >  <processor class="solr.RegexReplaceProcessorFactory">
>>> >    <str name="fieldName">content</str>
>>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >    <bool name="literalReplacement">true</bool>
>>> >  </processor>
>>> >
>>> > This will bring us back to the result of the previous index content,
>>> > meaning the issue of having the 4 <br> is still there.
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> >
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> >
>>> > wrote:
>>> >
>>> > > Hi Paul,
>>> > >
>>> > > Further to my previous email, which there was an extra "}" in the
>>> > > configuration, I have changed to use the below configuration based on
>>> > your
>>> > > suggestion.
>>> > >
>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>> > >    <str name="fieldName">content</str>
>>> > >    <str name="pattern">[ \t]*\r?\n</str>
>>> > >    <str name="replacement">&lt;br&gt;</str>
>>> > >    <bool name="literalReplacement">true</bool>
>>> > > </processor>
>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>> > >    <str name="fieldName">content</str>
>>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >    <bool name="literalReplacement">true</bool>
>>> > > </processor>
>>> > >
>>> > > However, the result that I get still has more than 2 <br>. In fact,
>>> the
>>> > > result become worse, as you can see from the comparison below.
>>> > >
>>> > > Example 1: The sentence that the regex pattern used to work
>>> correctly.
>>> > But
>>> > > with the latest pattern, it has now changed from 2 <br> to become 5
>>> <br>,
>>> > > which is wrong.
>>> > > *Original content in EML file:*
>>> > > Dear Sir,
>>> > >
>>> > >
>>> > > I am terminating
>>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>>> > terminating
>>> > >
>>> > > Example 2: The sentence that the above regex pattern is partially
>>> working
>>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > > *Original content in EML file:*
>>> > >
>>> > > *exalted*
>>> > >
>>> > > *Psalm 89:17*
>>> > >
>>> > >
>>> > > 3 Choa Chu Kang Avenue 4
>>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa
>>> > > Chu Kang Avenue 4, Singapore
>>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>>> <br><br>  3
>>> > > Choa Chu Kang Avenue 3, Singapor4
>>> > >
>>> > > Example 3: The sentence that the above regex pattern is partially
>>> working
>>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
>>> > code,
>>> > > there are now 5 <br>
>>> > > *Original content in EML file:*
>>> > >
>>> > > http://www.concorded.com/
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n
>>> \n\n
>>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018 at
>>> > > 10:07 AM
>>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> > > *Current Index content:* http://www.concorded.com/<br><br>
>>> <br><br><br>
>>> > > On Tue, Dec 18, 2018 at 10:07 AM
>>> > >
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>>> edwinyeozl@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Paul,
>>> > >>
>>> > >> Thank you for the reply.
>>> > >>
>>> > >> I have tried to add the following configuration according to your
>>> > >> suggestion:
>>> > >>
>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>    <str name="fieldName">content</str>
>>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>>> > >>    <str name="replacement">&lt;br&gt;</str>
>>> > >>    <bool name="literalReplacement">true</bool>
>>> > >> </processor>
>>> > >>
>>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>    <str name="fieldName">content</str>
>>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>    <bool name="literalReplacement">true</bool>
>>> > >> </processor>
>>> > >>
>>> > >> However, none of the \n is being removed this time round.
>>> > >> Is the order and/or the pattern correct?
>>> > >>
>>> > >> Regards,
>>> > >> Edwin
>>> > >>
>>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>>> > >>
>>> > >>> Hi Edwin
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Try for the first pattern/replacement
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>>> > >>>
>>> > >>> <str name="replacement">&lt;br&gt;</str>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Now all line endings and preceding whitespace characters should be
>>> > >>> changed to ‘<br>’.
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>>> > sequences
>>> > >>> to 2 ‘<br>’ sequences:
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>> > >>>
>>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Hope this approach works. Sorry for not replying earlier and best
>>> > >>> regards,
>>> > >>>
>>> > >>> Paul
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> > >>> Windows 10
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>>> > >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>> >
>>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> multiple \n
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>>> well.
>>> > >>>
>>> > >>> Regards,
>>> > >>> Edwin
>>> > >>>
>>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>> > Hi,
>>> > >>> >
>>> > >>> > Anyone else has other suggestions or have faced the same problem?
>>> > >>> >
>>> > >>> > Regards,
>>> > >>> > Edwin
>>> > >>> >
>>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>> > >>> edwinyeozl@gmail.com>
>>> > >>> > wrote:
>>> > >>> >
>>> > >>> >> Hi Paul,
>>> > >>> >>
>>> > >>> >> If I tried to execute the second step first, then I will only
>>> get a
>>> > >>> >> single <br> for those with 2 <br>.
>>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
>>> with a
>>> > >>> >> space in between.
>>> > >>> >>
>>> > >>> >> This is just changing the 2 <br> to be a single <br>, since the
>>> > second
>>> > >>> >> step is to replace with a single <br>.
>>> > >>> >> But it has not solved the underlying problem yet.
>>> > >>> >>
>>> > >>> >> Regards,
>>> > >>> >> Edwin
>>> > >>> >>
>>> > >>> >>
>>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>> > >>> >>
>>> > >>> >>> If the second step is executed first, then you will get the
>>> > unwanted
>>> > >>> 4
>>> > >>> >>> <br>
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>> Gesendet von Mail<
>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >>> für
>>> > >>> >>> Windows 10
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>>> solr-user@lucene.apache.org
>>> > >
>>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> > multiple
>>> > >>> \n
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>>
>>> > >>> >>> Hi Jörn ,
>>> > >>> >>>
>>> > >>> >>> Do you mean the regex is not correct?
>>> > >>> >>>
>>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps,
>>> like
>>> > >>> the one
>>> > >>> >>> shown below. The output that we get is still the same.
>>> > >>> >>>
>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>>      <str name="fieldName">content</str>
>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>> > >>> >>> <processor>
>>> > >>> >>>
>>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>>      <str name="fieldName">content</str>
>>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>>> > >>> >>>      <bool name="literalReplacement">true</bool>
>>> > >>> >>> <processor>
>>> > >>> >>>
>>> > >>> >>> Regards,
>>> > >>> >>> Edwin
>>> > >>> >>>
>>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <
>>> jornfranke@gmail.com>
>>> > >>> wrote:
>>> > >>> >>>
>>> > >>> >>> > Then you need two regexprocessfactory steps
>>> > >>> >>> >
>>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> > >>> >>> edwinyeozl@gmail.com
>>> > >>> >>> > >:
>>> > >>> >>> > >
>>> > >>> >>> > > Hi,
>>> > >>> >>> > >
>>> > >>> >>> > > Thanks for the reply.
>>> > >>> >>> > >
>>> > >>> >>> > > Do you know of any regex online tool that works correctly
>>> for
>>> > >>> Java
>>> > >>> >>> regex?
>>> > >>> >>> > > I tried to find some, but they are not working properly.
>>> > >>> >>> > >
>>> > >>> >>> > > Yes, our plan is to replace more than one \n with
>>> <br><br>, and
>>> > >>> >>> single \n
>>> > >>> >>> > > with single <br>.
>>> > >>> >>> > >
>>> > >>> >>> > > Regards,
>>> > >>> >>> > > Edwin
>>> > >>> >>> > >
>>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>>> > jornfranke@gmail.com
>>> > >>> >
>>> > >>> >>> wrote:
>>> > >>> >>> > >>
>>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug
>>> - it
>>> > >>> would
>>> > >>> >>> then
>>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that
>>> supports
>>> > Java
>>> > >>> >>> regex
>>> > >>> >>> > for
>>> > >>> >>> > >> your solution.
>>> > >>> >>> > >>
>>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>>> > >>> >>> > >> One that deals with single \n and one that deals with more
>>> > than
>>> > >>> one
>>> > >>> >>> \n
>>> > >>> >>> > >>
>>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> > >>> >>> > edwinyeozl@gmail.com
>>> > >>> >>> > >>> :
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> Hi,
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> We have tried with the following pattern ([
>>> \t]*\r?\n){2,}
>>> > and
>>> > >>> >>> > >>> configuration:
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>> > >>>  <str name="fieldName">content</str>
>>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>> > >>> >>> > >>> </processor>
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> However, the issue is still occurring.
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> Anyone else is able to help?
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> Regards,
>>> > >>> >>> > >>> Edwin
>>> > >>> >>> > >>>
>>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> > >>> >>> > edwinyeozl@gmail.com>
>>> > >>> >>> > >>> wrote:
>>> > >>> >>> > >>>
>>> > >>> >>> > >>>> Hi,
>>> > >>> >>> > >>>>
>>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
>>> > well.
>>> > >>> >>> > >>>>
>>> > >>> >>> > >>>> Regards,
>>> > >>> >>> > >>>> Edwin
>>> > >>> >>> > >>>>
>>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> > >>> >>> > edwinyeozl@gmail.com
>>> > >>> >>> > >>>
>>> > >>> >>> > >>>> wrote:
>>> > >>> >>> > >>>>
>>> > >>> >>> > >>>>> Hi,
>>> > >>> >>> > >>>>>
>>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>>> > >>> >>> > >>>>>
>>> > >>> >>> > >>>>> Regards,
>>> > >>> >>> > >>>>> Edwin
>>> > >>> >>> > >>>>>
>>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> > >>> >>> > edwinyeozl@gmail.com
>>> > >>> >>> > >>>
>>> > >>> >>> > >>>>> wrote:
>>> > >>> >>> > >>>>>
>>> > >>> >>> > >>>>>> Hi Paul,
>>> > >>> >>> > >>>>>>
>>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>>> when we
>>> > >>> try
>>> > >>> >>> in on
>>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>>> correct
>>> > >>> >>> result for
>>> > >>> >>> > >> all
>>> > >>> >>> > >>>>>> the examples (ie: All of them will only have
>>> <br><br>, and
>>> > >>> not
>>> > >>> >>> more
>>> > >>> >>> > >> than
>>> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>> > >>> examples).
>>> > >>> >>> > >>>>>>
>>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> > >>> >>> > >>>>>>
>>> > >>> >>> > >>>>>> Regards,
>>> > >>> >>> > >>>>>> Edwin
>>> > >>> >>> > >>>>>>
>>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> > >>> >>> > >> edwinyeozl@gmail.com>
>>> > >>> >>> > >>>>>> wrote:
>>> > >>> >>> > >>>>>>
>>> > >>> >>> > >>>>>>> Hi Paul,
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n
>>> i.e.
>>> > <str
>>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
>>> > regex
>>> > >>> >>> pattern:
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> > >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>> > >>>>>>> </processor>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> However, we are also getting the exact same results
>>> as
>>> > the
>>> > >>> >>> earlier
>>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have
>>> other
>>> > >>> (non
>>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that
>>> there are
>>> > >>> no
>>> > >>> >>> non
>>> > >>> >>> > >> printing
>>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You
>>> can
>>> > >>> refer
>>> > >>> >>> to the
>>> > >>> >>> > >>>>>>> original content in the same examples below.
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern
>>> is
>>> > >>> working
>>> > >>> >>> > >>>>>>> correctly
>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>> > >>> >>> > >>>>>>> Dear Sir,
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> I am terminating
>>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> > >>> terminating
>>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>> terminating
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern
>>> is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>> are 4
>>> > >>> <br>)
>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> *exalted*
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> *Psalm 89:17*
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>>  \n\n
>>> > >>> >>>  \n\n  3
>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>  <br><br>
>>> > >>> >>> <br><br>3
>>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern
>>> is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there
>>> are 4
>>> > >>> <br>)
>>> > >>> >>> > >>>>>>> *Original content in EML file:*
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>> >>> > >>>>>>> *Original content:*
>>> http://www.concordpri.moe.edu.sg/
>>> > >>>  \n\n
>>> > >>> >>> >  \n\n
>>> > >>> >>> > >> \n
>>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> > >>> \n\n\n  On
>>> > >>> >>> Tue,
>>> > >>> >>> > >> Dec 18,
>>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>>> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> > >>>  <br><br>
>>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you
>>> may
>>> > >>> have.
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Thank you.
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>> Regards,
>>> > >>> >>> > >>>>>>> Edwin
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.dodd@ub.unibe.ch
>>> >
>>> > >>> wrote:
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Hi Edwin
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
>>> > preceed
>>> > >>> >>> the \n
>>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non
>>> printing)
>>> > >>> >>> characters
>>> > >>> >>> > >>>>>>>> than \n?
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >>> >>> > >> für
>>> > >>> >>> > >>>>>>>> Windows 10
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>> edwinyeozl@gmail.com>
>>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > >>> >>> > solr-user@lucene.apache.org>
>>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> > >>> detect
>>> > >>> >>> > >> multiple \n
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Hi Paul,
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as
>>> follow:
>>> > >>> >>> > >>>>>>>> <processor
>>> class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> > >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>> > >>>>>>>> </processor>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of
>>> Example
>>> > 1,2
>>> > >>> and
>>> > >>> >>> 3
>>> > >>> >>> > >> below.
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex
>>> pattern is
>>> > >>> >>> working
>>> > >>> >>> > >>>>>>>> correctly
>>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> > >>> >>> terminating
>>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>> terminating
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex
>>> pattern is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>>> working
>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>> >  \n\n
>>> > >>> >>>  \n\n
>>> > >>> >>> > 3
>>> > >>> >>> > >>>>>>>> Choa
>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>>  <br><br>
>>> > >>> >>> > <br><br>3
>>> > >>> >>> > >>>>>>>> Choa
>>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex
>>> pattern is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>>> working
>>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> > >>> >>> > >>>>>>>> *Original content:*
>>> http://www.concordpri.moe.edu.sg/
>>> > >>>  \n\n
>>> > >>> >>> >  \n\n
>>> > >>> >>> > >>>>>>>> \n \n\n
>>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> \n\n\n
>>> > On
>>> > >>> >>> Tue, Dec
>>> > >>> >>> > >> 18,
>>> > >>> >>> > >>>>>>>> 2018
>>> > >>> >>> > >>>>>>>> at 10:07 AM
>>> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> > >>>  <br><br>
>>> > >>> >>> > >>>>>>>> <br><br>On
>>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Any further suggestion?
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Thank you.
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>> Regards,
>>> > >>> >>> > >>>>>>>> Edwin
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <
>>> paul.dodd@ub.unibe.ch>
>>> > >>> wrote:
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>>> > >>> failing
>>> > >>> >>> on
>>> > >>> >>> > the
>>> > >>> >>> > >>>>>>>> {2,}
>>> > >>> >>> > >>>>>>>>> part you could try
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> > >>> >>> > >
>>> > >>> >>> > >>>>>>>> für
>>> > >>> >>> > >>>>>>>>> Windows 10
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>> edwinyeozl@gmail.com>
>>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > >>> >>> > solr-user@lucene.apache.org
>>> > >>> >>> > >>>
>>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern
>>> to
>>> > >>> detect
>>> > >>> >>> > >> multiple
>>> > >>> >>> > >>>>>>>> \n
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Hi Paul,
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> When I use this pattern:
>>> > >>> >>> > >>>>>>>>> <processor
>>> class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> > >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>> > >>>>>>>>> </processor>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>>> content
>>> > >>> and
>>> > >>> >>> not
>>> > >>> >>> > >>>>>>>> working for
>>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that
>>> is
>>> > >>> working
>>> > >>> >>> and
>>> > >>> >>> > >>>>>>>> another
>>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>>> pattern is
>>> > >>> >>> working
>>> > >>> >>> > >>>>>>>> correctly
>>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> > >>> >>> terminating
>>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>>> > terminating
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>>> pattern is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>>> working
>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>> >  \n\n
>>> > >>> >>> >  \n\n  3
>>> > >>> >>> > >>>>>>>> Choa
>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>> >  <br><br>
>>> > >>> >>> > <br><br>3
>>> > >>> >>> > >>>>>>>> Choa
>>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>>> pattern is
>>> > >>> >>> partially
>>> > >>> >>> > >>>>>>>> working
>>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> > >>> >>> > >>>>>>>>> *Original content:*
>>> http://www.concordpri.moe.edu.sg/
>>> > >>>  \n\n
>>> > >>> >>> > >> \n\n
>>> > >>> >>> > >>>>>>>> \n
>>> > >>> >>> > >>>>>>>>> \n\n
>>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> \n\n\n
>>> > On
>>> > >>> >>> Tue,
>>> > >>> >>> > Dec
>>> > >>> >>> > >>>>>>>> 18, 2018
>>> > >>> >>> > >>>>>>>>> at 10:07 AM
>>> > >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> > >>> >>>  <br><br>
>>> > >>> >>> > >>>>>>>> <br><br>On
>>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Thank you.
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>> Regards,
>>> > >>> >>> > >>>>>>>>> Edwin
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>>> paul.dodd@ub.unibe.ch>
>>> > >>> wrote:
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>> > >>> working. I
>>> > >>> >>> > assume
>>> > >>> >>> > >>>>>>>> nothing
>>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> ??
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >>> >>> > >>>>>>>> für
>>> > >>> >>> > >>>>>>>>>> Windows 10
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>>> edwinyeozl@gmail.com>
>>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > >>> >>> > >> solr-user@lucene.apache.org
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
>>> > detect
>>> > >>> >>> multiple
>>> > >>> >>> > >> \n
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Hi,
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> I am trying to use the
>>> RegexReplaceProcessorFactory to
>>> > >>> >>> remove
>>> > >>> >>> > more
>>> > >>> >>> > >>>>>>>> than
>>> > >>> >>> > >>>>>>>>> two
>>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>>> \n\n,
>>> > \n
>>> > >>> \n,
>>> > >>> >>> \n
>>> > >>> >>> > \n
>>> > >>> >>> > >>>>>>>> \n
>>> > >>> >>> > >>>>>>>>> \n),
>>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is
>>> working
>>> > >>> when I
>>> > >>> >>> test
>>> > >>> >>> > it
>>> > >>> >>> > >>>>>>>> in
>>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
>>> > >>> inside
>>> > >>> >>> the
>>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> > >>> >>> > >>>>>>>>>> <processor
>>> class="solr.RegexReplaceProcessorFactory">
>>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >>> >>> > >>>>>>>>>>  <str
>>> name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>> >>> > >>>>>>>>>> </processor>
>>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> > >>> >>> instructing
>>> > >>> >>> > the
>>> > >>> >>> > >>>>>>>> regex
>>> > >>> >>> > >>>>>>>>> to
>>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>> > >>> instructing
>>> > >>> >>> the
>>> > >>> >>> > >>>>>>>> regex to
>>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
>>> should
>>> > >>> I do
>>> > >>> >>> it?
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>> Regards,
>>> > >>> >>> > >>>>>>>>>> Edwin
>>> > >>> >>> > >>>>>>>>>>
>>> > >>> >>> > >>>>>>>>>
>>> > >>> >>> > >>>>>>>>
>>> > >>> >>> > >>>>>>>
>>> > >>> >>> > >>
>>> > >>> >>> >
>>> > >>> >>>
>>> > >>> >>
>>> > >>>
>>> > >>
>>> >
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Sorry, I realized there is an extra ']' in the pattern provided, which is
why there are so many <br> in the output.

The output is exactly the same as previously (previous index result) if we
remove the extra ']', as shown in the configuration below.

 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t\x0b\f]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
 </processor>
 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;[ \t\x0b\f]*){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
 </processor>

Regards,
Edwin



On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Thanks for the reply.
>
> For the 2nd pattern, if we put this pattern <str
> name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
> configurations below:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> It will not be able to change all those more than 3 <br> to 2 <br>.
>
> We will end up with many <br> in the output, like the example below:
>
>  http://www.concorded.com/<br><br>  <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> On Tue, Dec 18, 2018
>
>
> Regards,
> Edwin
>
>
>
>
> On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>> I can’t understand why the pattern is not working and where the spaces
>> between the <br> are coming from. It should be possible to allow for spaces
>> between the <br> in the second match pattern however i.e. 2nd pattern
>>
>>
>>
>> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>>
>>
>>
>> /Paul
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> Gesendet: Mittwoch, 6. März 2019 16:28
>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> I have tried with the first match pattern to be <str name="pattern">[
>> \t\x0b\f]*\r?\n</str>, like the configuration below:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> However, the result is still the same as before (previous index results),
>> with the 4 <br>.
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>>
>> > Hi Edwin
>> >
>> >
>> >
>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>,
>> it’s
>> > actually the sequence «<br><br>  <br><br>»? So perhaps the first match
>> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>> >
>> >
>> >
>> > i.e. [space tab vertical-tab formfeed]
>> >
>> >
>> >
>> > Regards,
>> >
>> > Paul
>> >
>> >
>> >
>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > Gesendet: Mittwoch, 6. März 2019 07:44
>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
>> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
>> > will actually look for 6 or more <br> instead of 3 <br>,  as we have put
>> > the <br> two times in the pattern, which is the reason that there are
>> more
>> > <br> in the result, as cases where there are less than 6 <br> are not
>> being
>> > replaced, so we ended up having up to 5 <br> in the index.
>> >
>> > Modified configuration:
>> >  <processor class="solr.RegexReplaceProcessorFactory">
>> >    <str name="fieldName">content</str>
>> >    <str name="pattern">(&lt;br&gt;){3,}</str>
>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >    <bool name="literalReplacement">true</bool>
>> >  </processor>
>> >
>> > This will bring us back to the result of the previous index content,
>> > meaning the issue of having the 4 <br> is still there.
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> > wrote:
>> >
>> > > Hi Paul,
>> > >
>> > > Further to my previous email, which there was an extra "}" in the
>> > > configuration, I have changed to use the below configuration based on
>> > your
>> > > suggestion.
>> > >
>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> > >    <str name="fieldName">content</str>
>> > >    <str name="pattern">[ \t]*\r?\n</str>
>> > >    <str name="replacement">&lt;br&gt;</str>
>> > >    <bool name="literalReplacement">true</bool>
>> > > </processor>
>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> > >    <str name="fieldName">content</str>
>> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >    <bool name="literalReplacement">true</bool>
>> > > </processor>
>> > >
>> > > However, the result that I get still has more than 2 <br>. In fact,
>> the
>> > > result become worse, as you can see from the comparison below.
>> > >
>> > > Example 1: The sentence that the regex pattern used to work correctly.
>> > But
>> > > with the latest pattern, it has now changed from 2 <br> to become 5
>> <br>,
>> > > which is wrong.
>> > > *Original content in EML file:*
>> > > Dear Sir,
>> > >
>> > >
>> > > I am terminating
>> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
>> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
>> > terminating
>> > >
>> > > Example 2: The sentence that the above regex pattern is partially
>> working
>> > > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > > *Original content in EML file:*
>> > >
>> > > *exalted*
>> > >
>> > > *Psalm 89:17*
>> > >
>> > >
>> > > 3 Choa Chu Kang Avenue 4
>> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>> Choa
>> > > Chu Kang Avenue 4, Singapore
>> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
>> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>
>> <br><br>  3
>> > > Choa Chu Kang Avenue 3, Singapor4
>> > >
>> > > Example 3: The sentence that the above regex pattern is partially
>> working
>> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
>> > code,
>> > > there are now 5 <br>
>> > > *Original content in EML file:*
>> > >
>> > > http://www.concorded.com/
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n
>> \n\n
>> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018 at
>> > > 10:07 AM
>> > > *Previous Index content: *http://www.concorded.com/   <br><br>
>> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> > > *Current Index content:* http://www.concorded.com/<br><br>
>> <br><br><br>
>> > > On Tue, Dec 18, 2018 at 10:07 AM
>> > >
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi Paul,
>> > >>
>> > >> Thank you for the reply.
>> > >>
>> > >> I have tried to add the following configuration according to your
>> > >> suggestion:
>> > >>
>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>    <str name="fieldName">content</str>
>> > >>    <str name="pattern">[ \t]*\r?\n}</str>
>> > >>    <str name="replacement">&lt;br&gt;</str>
>> > >>    <bool name="literalReplacement">true</bool>
>> > >> </processor>
>> > >>
>> > >> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>    <str name="fieldName">content</str>
>> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>    <bool name="literalReplacement">true</bool>
>> > >> </processor>
>> > >>
>> > >> However, none of the \n is being removed this time round.
>> > >> Is the order and/or the pattern correct?
>> > >>
>> > >> Regards,
>> > >> Edwin
>> > >>
>> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>> > >>
>> > >>> Hi Edwin
>> > >>>
>> > >>>
>> > >>>
>> > >>> Try for the first pattern/replacement
>> > >>>
>> > >>>
>> > >>>
>> > >>> <str name="pattern">[ \t]*\r?\n</str>
>> > >>>
>> > >>> <str name="replacement">&lt;br&gt;</str>
>> > >>>
>> > >>>
>> > >>>
>> > >>> Now all line endings and preceding whitespace characters should be
>> > >>> changed to ‘<br>’.
>> > >>>
>> > >>>
>> > >>>
>> > >>> The second pattern replacement should replace 3 or more ‘<br>’
>> > sequences
>> > >>> to 2 ‘<br>’ sequences:
>> > >>>
>> > >>>
>> > >>>
>> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>> > >>>
>> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>
>> > >>>
>> > >>>
>> > >>> Hope this approach works. Sorry for not replying earlier and best
>> > >>> regards,
>> > >>>
>> > >>> Paul
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>> > >>> Windows 10
>> > >>>
>> > >>>
>> > >>>
>> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > >>> Gesendet: Dienstag, 5. März 2019 03:35
>> > >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>> > >>>
>> > >>>
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as
>> well.
>> > >>>
>> > >>> Regards,
>> > >>> Edwin
>> > >>>
>> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>> > Hi,
>> > >>> >
>> > >>> > Anyone else has other suggestions or have faced the same problem?
>> > >>> >
>> > >>> > Regards,
>> > >>> > Edwin
>> > >>> >
>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>> > >>> edwinyeozl@gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> >> Hi Paul,
>> > >>> >>
>> > >>> >> If I tried to execute the second step first, then I will only
>> get a
>> > >>> >> single <br> for those with 2 <br>.
>> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
>> with a
>> > >>> >> space in between.
>> > >>> >>
>> > >>> >> This is just changing the 2 <br> to be a single <br>, since the
>> > second
>> > >>> >> step is to replace with a single <br>.
>> > >>> >> But it has not solved the underlying problem yet.
>> > >>> >>
>> > >>> >> Regards,
>> > >>> >> Edwin
>> > >>> >>
>> > >>> >>
>> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>> > >>> >>
>> > >>> >>> If the second step is executed first, then you will get the
>> > unwanted
>> > >>> 4
>> > >>> >>> <br>
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >>> für
>> > >>> >>> Windows 10
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> > >>> >>> An: solr-user@lucene.apache.org<mailto:
>> solr-user@lucene.apache.org
>> > >
>> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> > multiple
>> > >>> \n
>> > >>> >>>
>> > >>> >>>
>> > >>> >>>
>> > >>> >>> Hi Jörn ,
>> > >>> >>>
>> > >>> >>> Do you mean the regex is not correct?
>> > >>> >>>
>> > >>> >>> We are already using two RegexReplaceProcessorFactory steps,
>> like
>> > >>> the one
>> > >>> >>> shown below. The output that we get is still the same.
>> > >>> >>>
>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>> >>>      <str name="fieldName">content</str>
>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> > >>> >>> <processor>
>> > >>> >>>
>> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>> >>>      <str name="fieldName">content</str>
>> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
>> > >>> >>>      <bool name="literalReplacement">true</bool>
>> > >>> >>> <processor>
>> > >>> >>>
>> > >>> >>> Regards,
>> > >>> >>> Edwin
>> > >>> >>>
>> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfranke@gmail.com
>> >
>> > >>> wrote:
>> > >>> >>>
>> > >>> >>> > Then you need two regexprocessfactory steps
>> > >>> >>> >
>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> > >>> >>> edwinyeozl@gmail.com
>> > >>> >>> > >:
>> > >>> >>> > >
>> > >>> >>> > > Hi,
>> > >>> >>> > >
>> > >>> >>> > > Thanks for the reply.
>> > >>> >>> > >
>> > >>> >>> > > Do you know of any regex online tool that works correctly
>> for
>> > >>> Java
>> > >>> >>> regex?
>> > >>> >>> > > I tried to find some, but they are not working properly.
>> > >>> >>> > >
>> > >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>,
>> and
>> > >>> >>> single \n
>> > >>> >>> > > with single <br>.
>> > >>> >>> > >
>> > >>> >>> > > Regards,
>> > >>> >>> > > Edwin
>> > >>> >>> > >
>> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
>> > jornfranke@gmail.com
>> > >>> >
>> > >>> >>> wrote:
>> > >>> >>> > >>
>> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug -
>> it
>> > >>> would
>> > >>> >>> then
>> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
>> > Java
>> > >>> >>> regex
>> > >>> >>> > for
>> > >>> >>> > >> your solution.
>> > >>> >>> > >>
>> > >>> >>> > >> I believe you want to have 2 regex process factories:
>> > >>> >>> > >> One that deals with single \n and one that deals with more
>> > than
>> > >>> one
>> > >>> >>> \n
>> > >>> >>> > >>
>> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> > >>> >>> > edwinyeozl@gmail.com
>> > >>> >>> > >>> :
>> > >>> >>> > >>>
>> > >>> >>> > >>> Hi,
>> > >>> >>> > >>>
>> > >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
>> > and
>> > >>> >>> > >>> configuration:
>> > >>> >>> > >>>
>> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>> >>> > >>>  <str name="fieldName">content</str>
>> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
>> > >>> >>> > >>> </processor>
>> > >>> >>> > >>>
>> > >>> >>> > >>> However, the issue is still occurring.
>> > >>> >>> > >>>
>> > >>> >>> > >>> Anyone else is able to help?
>> > >>> >>> > >>>
>> > >>> >>> > >>> Regards,
>> > >>> >>> > >>> Edwin
>> > >>> >>> > >>>
>> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> > >>> >>> > edwinyeozl@gmail.com>
>> > >>> >>> > >>> wrote:
>> > >>> >>> > >>>
>> > >>> >>> > >>>> Hi,
>> > >>> >>> > >>>>
>> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
>> > well.
>> > >>> >>> > >>>>
>> > >>> >>> > >>>> Regards,
>> > >>> >>> > >>>> Edwin
>> > >>> >>> > >>>>
>> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> > >>> >>> > edwinyeozl@gmail.com
>> > >>> >>> > >>>
>> > >>> >>> > >>>> wrote:
>> > >>> >>> > >>>>
>> > >>> >>> > >>>>> Hi,
>> > >>> >>> > >>>>>
>> > >>> >>> > >>>>> Should we report this as a bug in Solr?
>> > >>> >>> > >>>>>
>> > >>> >>> > >>>>> Regards,
>> > >>> >>> > >>>>> Edwin
>> > >>> >>> > >>>>>
>> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> > >>> >>> > edwinyeozl@gmail.com
>> > >>> >>> > >>>
>> > >>> >>> > >>>>> wrote:
>> > >>> >>> > >>>>>
>> > >>> >>> > >>>>>> Hi Paul,
>> > >>> >>> > >>>>>>
>> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using,
>> when we
>> > >>> try
>> > >>> >>> in on
>> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
>> correct
>> > >>> >>> result for
>> > >>> >>> > >> all
>> > >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>,
>> and
>> > >>> not
>> > >>> >>> more
>> > >>> >>> > >> than
>> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>> > >>> examples).
>> > >>> >>> > >>>>>>
>> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>> > >>> >>> > >>>>>>
>> > >>> >>> > >>>>>> Regards,
>> > >>> >>> > >>>>>> Edwin
>> > >>> >>> > >>>>>>
>> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> > >>> >>> > >> edwinyeozl@gmail.com>
>> > >>> >>> > >>>>>> wrote:
>> > >>> >>> > >>>>>>
>> > >>> >>> > >>>>>>> Hi Paul,
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e.
>> > <str
>> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
>> > regex
>> > >>> >>> pattern:
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
>> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> > >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>> > >>>>>>> </processor>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> However, we are also getting the exact same results as
>> > the
>> > >>> >>> earlier
>> > >>> >>> > >>>>>>> Example 1, 2 and 3.
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have
>> other
>> > >>> (non
>> > >>> >>> > >>>>>>> printing) characters than \n, we have find that there
>> are
>> > >>> no
>> > >>> >>> non
>> > >>> >>> > >> printing
>> > >>> >>> > >>>>>>> characters. It is just next line with a space. You can
>> > >>> refer
>> > >>> >>> to the
>> > >>> >>> > >>>>>>> original content in the same examples below.
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern
>> is
>> > >>> working
>> > >>> >>> > >>>>>>> correctly
>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> > >>> >>> > >>>>>>> Dear Sir,
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> I am terminating
>> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> > >>> terminating
>> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> terminating
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern
>> is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are
>> 4
>> > >>> <br>)
>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> *exalted*
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> *Psalm 89:17*
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>>  \n\n
>> > >>> >>>  \n\n  3
>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>  <br><br>
>> > >>> >>> <br><br>3
>> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern
>> is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are
>> 4
>> > >>> <br>)
>> > >>> >>> > >>>>>>> *Original content in EML file:*
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> > >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>> > >>>  \n\n
>> > >>> >>> >  \n\n
>> > >>> >>> > >> \n
>> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> > >>> \n\n\n  On
>> > >>> >>> Tue,
>> > >>> >>> > >> Dec 18,
>> > >>> >>> > >>>>>>> 2018 at 10:07 AM
>> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> > >>>  <br><br>
>> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
>> > >>> have.
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Thank you.
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>> Regards,
>> > >>> >>> > >>>>>>> Edwin
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
>> > >>> wrote:
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Hi Edwin
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
>> > preceed
>> > >>> >>> the \n
>> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>> > >>> >>> characters
>> > >>> >>> > >>>>>>>> than \n?
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Gesendet von Mail<
>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >>> >>> > >> für
>> > >>> >>> > >>>>>>>> Windows 10
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com
>> >
>> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > >>> >>> > solr-user@lucene.apache.org>
>> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>> > >>> detect
>> > >>> >>> > >> multiple \n
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Hi Paul,
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>> > >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> > >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>> > >>>>>>>> </processor>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> But we still have exactly the same problem of Example
>> > 1,2
>> > >>> and
>> > >>> >>> 3
>> > >>> >>> > >> below.
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern
>> is
>> > >>> >>> working
>> > >>> >>> > >>>>>>>> correctly
>> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> > >>> >>> terminating
>> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> terminating
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern
>> is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>>> working
>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>> >  \n\n
>> > >>> >>>  \n\n
>> > >>> >>> > 3
>> > >>> >>> > >>>>>>>> Choa
>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>>  <br><br>
>> > >>> >>> > <br><br>3
>> > >>> >>> > >>>>>>>> Choa
>> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern
>> is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>>> working
>> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>> >>> > >>>>>>>> *Original content:*
>> http://www.concordpri.moe.edu.sg/
>> > >>>  \n\n
>> > >>> >>> >  \n\n
>> > >>> >>> > >>>>>>>> \n \n\n
>> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
>> > On
>> > >>> >>> Tue, Dec
>> > >>> >>> > >> 18,
>> > >>> >>> > >>>>>>>> 2018
>> > >>> >>> > >>>>>>>> at 10:07 AM
>> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> > >>>  <br><br>
>> > >>> >>> > >>>>>>>> <br><br>On
>> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Any further suggestion?
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Thank you.
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>> Regards,
>> > >>> >>> > >>>>>>>> Edwin
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.dodd@ub.unibe.ch
>> >
>> > >>> wrote:
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>> > >>> failing
>> > >>> >>> on
>> > >>> >>> > the
>> > >>> >>> > >>>>>>>> {2,}
>> > >>> >>> > >>>>>>>>> part you could try
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Gesendet von Mail<
>> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>> > >>> >>> > >
>> > >>> >>> > >>>>>>>> für
>> > >>> >>> > >>>>>>>>> Windows 10
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> edwinyeozl@gmail.com>
>> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > >>> >>> > solr-user@lucene.apache.org
>> > >>> >>> > >>>
>> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>> > >>> detect
>> > >>> >>> > >> multiple
>> > >>> >>> > >>>>>>>> \n
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Hi Paul,
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Thanks for your reply.
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> When I use this pattern:
>> > >>> >>> > >>>>>>>>> <processor
>> class="solr.RegexReplaceProcessorFactory">
>> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> > >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>> > >>>>>>>>> </processor>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
>> content
>> > >>> and
>> > >>> >>> not
>> > >>> >>> > >>>>>>>> working for
>> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>> > >>> working
>> > >>> >>> and
>> > >>> >>> > >>>>>>>> another
>> > >>> >>> > >>>>>>>>> that is not working (partially working):
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex
>> pattern is
>> > >>> >>> working
>> > >>> >>> > >>>>>>>> correctly
>> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> > >>> >>> terminating
>> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
>> > terminating
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex
>> pattern is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>>> working
>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>> >  \n\n
>> > >>> >>> >  \n\n  3
>> > >>> >>> > >>>>>>>> Choa
>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>> >  <br><br>
>> > >>> >>> > <br><br>3
>> > >>> >>> > >>>>>>>> Choa
>> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex
>> pattern is
>> > >>> >>> partially
>> > >>> >>> > >>>>>>>> working
>> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> > >>> >>> > >>>>>>>>> *Original content:*
>> http://www.concordpri.moe.edu.sg/
>> > >>>  \n\n
>> > >>> >>> > >> \n\n
>> > >>> >>> > >>>>>>>> \n
>> > >>> >>> > >>>>>>>>> \n\n
>> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>> \n\n\n
>> > On
>> > >>> >>> Tue,
>> > >>> >>> > Dec
>> > >>> >>> > >>>>>>>> 18, 2018
>> > >>> >>> > >>>>>>>>> at 10:07 AM
>> > >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> > >>> >>>  <br><br>
>> > >>> >>> > >>>>>>>> <br><br>On
>> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Thank you.
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>> Regards,
>> > >>> >>> > >>>>>>>>> Edwin
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <
>> paul.dodd@ub.unibe.ch>
>> > >>> wrote:
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>> > >>> working. I
>> > >>> >>> > assume
>> > >>> >>> > >>>>>>>> nothing
>> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> ??
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
>> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >>> >>> > >>>>>>>> für
>> > >>> >>> > >>>>>>>>>> Windows 10
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
>> edwinyeozl@gmail.com>
>> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > >>> >>> > >> solr-user@lucene.apache.org
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
>> > detect
>> > >>> >>> multiple
>> > >>> >>> > >> \n
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Hi,
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> I am trying to use the
>> RegexReplaceProcessorFactory to
>> > >>> >>> remove
>> > >>> >>> > more
>> > >>> >>> > >>>>>>>> than
>> > >>> >>> > >>>>>>>>> two
>> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg:
>> \n\n,
>> > \n
>> > >>> \n,
>> > >>> >>> \n
>> > >>> >>> > \n
>> > >>> >>> > >>>>>>>> \n
>> > >>> >>> > >>>>>>>>> \n),
>> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
>> > >>> when I
>> > >>> >>> test
>> > >>> >>> > it
>> > >>> >>> > >>>>>>>> in
>> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
>> > >>> inside
>> > >>> >>> the
>> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>> > >>> >>> > >>>>>>>>>> <processor
>> class="solr.RegexReplaceProcessorFactory">
>> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> > >>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>> >>> > >>>>>>>>>> </processor>
>> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>> > >>> >>> instructing
>> > >>> >>> > the
>> > >>> >>> > >>>>>>>> regex
>> > >>> >>> > >>>>>>>>> to
>> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>> > >>> instructing
>> > >>> >>> the
>> > >>> >>> > >>>>>>>> regex to
>> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
>> should
>> > >>> I do
>> > >>> >>> it?
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>> Regards,
>> > >>> >>> > >>>>>>>>>> Edwin
>> > >>> >>> > >>>>>>>>>>
>> > >>> >>> > >>>>>>>>>
>> > >>> >>> > >>>>>>>>
>> > >>> >>> > >>>>>>>
>> > >>> >>> > >>
>> > >>> >>> >
>> > >>> >>>
>> > >>> >>
>> > >>>
>> > >>
>> >
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Thanks for the reply.

For the 2nd pattern, if we put this pattern <str
name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>, which is like the
configurations below:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t\x0b\f]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

It will not be able to change all those more than 3 <br> to 2 <br>.

We will end up with many <br> in the output, like the example below:

 http://www.concorded.com/<br><br>
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
On Tue, Dec 18, 2018


Regards,
Edwin




On Thu, 7 Mar 2019 at 20:44, <pa...@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
> I can’t understand why the pattern is not working and where the spaces
> between the <br> are coming from. It should be possible to allow for spaces
> between the <br> in the second match pattern however i.e. 2nd pattern
>
>
>
> <str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>
>
>
>
> /Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 16:28
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have tried with the first match pattern to be <str name="pattern">[
> \t\x0b\f]*\r?\n</str>, like the configuration below:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, the result is still the same as before (previous index results),
> with the 4 <br>.
>
> Regards,
> Edwin
>
>
> On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:
>
> > Hi Edwin
> >
> >
> >
> > You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>, it’s
> > actually the sequence «<br><br>  <br><br>»? So perhaps the first match
> > pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
> >
> >
> >
> > i.e. [space tab vertical-tab formfeed]
> >
> >
> >
> > Regards,
> >
> > Paul
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > Gesendet: Mittwoch, 6. März 2019 07:44
> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
> > (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
> > will actually look for 6 or more <br> instead of 3 <br>,  as we have put
> > the <br> two times in the pattern, which is the reason that there are
> more
> > <br> in the result, as cases where there are less than 6 <br> are not
> being
> > replaced, so we ended up having up to 5 <br> in the index.
> >
> > Modified configuration:
> >  <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(&lt;br&gt;){3,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> >  </processor>
> >
> > This will bring us back to the result of the previous index content,
> > meaning the issue of having the 4 <br> is still there.
> >
> > Regards,
> > Edwin
> >
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > Hi Paul,
> > >
> > > Further to my previous email, which there was an extra "}" in the
> > > configuration, I have changed to use the below configuration based on
> > your
> > > suggestion.
> > >
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >    <str name="fieldName">content</str>
> > >    <str name="pattern">[ \t]*\r?\n</str>
> > >    <str name="replacement">&lt;br&gt;</str>
> > >    <bool name="literalReplacement">true</bool>
> > > </processor>
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >    <str name="fieldName">content</str>
> > >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >    <bool name="literalReplacement">true</bool>
> > > </processor>
> > >
> > > However, the result that I get still has more than 2 <br>. In fact, the
> > > result become worse, as you can see from the comparison below.
> > >
> > > Example 1: The sentence that the regex pattern used to work correctly.
> > But
> > > with the latest pattern, it has now changed from 2 <br> to become 5
> <br>,
> > > which is wrong.
> > > *Original content in EML file:*
> > > Dear Sir,
> > >
> > >
> > > I am terminating
> > > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> > > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
> > terminating
> > >
> > > Example 2: The sentence that the above regex pattern is partially
> working
> > > (as you can see, instead of 2 <br>, there are 4 <br>)
> > > *Original content in EML file:*
> > >
> > > *exalted*
> > >
> > > *Psalm 89:17*
> > >
> > >
> > > 3 Choa Chu Kang Avenue 4
> > > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> Choa
> > > Chu Kang Avenue 4, Singapore
> > > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > > <br><br>3 Choa Chu Kang Avenue 4, Singapore
> > > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>
> 3
> > > Choa Chu Kang Avenue 3, Singapor4
> > >
> > > Example 3: The sentence that the above regex pattern is partially
> working
> > > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
> > code,
> > > there are now 5 <br>
> > > *Original content in EML file:*
> > >
> > > http://www.concorded.com/
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Dec 18, 2018 at 10:07 AM
> > > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n
> \n\n
> > > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at
> > > 10:07 AM
> > > *Previous Index content: *http://www.concorded.com/   <br><br>
> > > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > > *Current Index content:* http://www.concorded.com/<br><br>
> <br><br><br>
> > > On Tue, Dec 18, 2018 at 10:07 AM
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Paul,
> > >>
> > >> Thank you for the reply.
> > >>
> > >> I have tried to add the following configuration according to your
> > >> suggestion:
> > >>
> > >> <processor class="solr.RegexReplaceProcessorFactory">
> > >>    <str name="fieldName">content</str>
> > >>    <str name="pattern">[ \t]*\r?\n}</str>
> > >>    <str name="replacement">&lt;br&gt;</str>
> > >>    <bool name="literalReplacement">true</bool>
> > >> </processor>
> > >>
> > >> <processor class="solr.RegexReplaceProcessorFactory">
> > >>    <str name="fieldName">content</str>
> > >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> > >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>    <bool name="literalReplacement">true</bool>
> > >> </processor>
> > >>
> > >> However, none of the \n is being removed this time round.
> > >> Is the order and/or the pattern correct?
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
> > >>
> > >>> Hi Edwin
> > >>>
> > >>>
> > >>>
> > >>> Try for the first pattern/replacement
> > >>>
> > >>>
> > >>>
> > >>> <str name="pattern">[ \t]*\r?\n</str>
> > >>>
> > >>> <str name="replacement">&lt;br&gt;</str>
> > >>>
> > >>>
> > >>>
> > >>> Now all line endings and preceding whitespace characters should be
> > >>> changed to ‘<br>’.
> > >>>
> > >>>
> > >>>
> > >>> The second pattern replacement should replace 3 or more ‘<br>’
> > sequences
> > >>> to 2 ‘<br>’ sequences:
> > >>>
> > >>>
> > >>>
> > >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> > >>>
> > >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>
> > >>>
> > >>>
> > >>> Hope this approach works. Sorry for not replying earlier and best
> > >>> regards,
> > >>>
> > >>> Paul
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> für
> > >>> Windows 10
> > >>>
> > >>>
> > >>>
> > >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>> Gesendet: Dienstag, 5. März 2019 03:35
> > >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
> \n
> > >>>
> > >>>
> > >>>
> > >>> Hi,
> > >>>
> > >>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> > Anyone else has other suggestions or have faced the same problem?
> > >>> >
> > >>> > Regards,
> > >>> > Edwin
> > >>> >
> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
> > >>> edwinyeozl@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> >> Hi Paul,
> > >>> >>
> > >>> >> If I tried to execute the second step first, then I will only get
> a
> > >>> >> single <br> for those with 2 <br>.
> > >>> >> For those that we originally get 4 <br>, there will be 2 <br>
> with a
> > >>> >> space in between.
> > >>> >>
> > >>> >> This is just changing the 2 <br> to be a single <br>, since the
> > second
> > >>> >> step is to replace with a single <br>.
> > >>> >> But it has not solved the underlying problem yet.
> > >>> >>
> > >>> >> Regards,
> > >>> >> Edwin
> > >>> >>
> > >>> >>
> > >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
> > >>> >>
> > >>> >>> If the second step is executed first, then you will get the
> > unwanted
> > >>> 4
> > >>> >>> <br>
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> > >>> für
> > >>> >>> Windows 10
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> > >>> >>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> > >
> > >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > multiple
> > >>> \n
> > >>> >>>
> > >>> >>>
> > >>> >>>
> > >>> >>> Hi Jörn ,
> > >>> >>>
> > >>> >>> Do you mean the regex is not correct?
> > >>> >>>
> > >>> >>> We are already using two RegexReplaceProcessorFactory steps, like
> > >>> the one
> > >>> >>> shown below. The output that we get is still the same.
> > >>> >>>
> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>>      <str name="fieldName">content</str>
> > >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>>      <bool name="literalReplacement">true</bool>
> > >>> >>> <processor>
> > >>> >>>
> > >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>>      <str name="fieldName">content</str>
> > >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> > >>> >>>      <str name="replacement">&lt;br&gt;</str>
> > >>> >>>      <bool name="literalReplacement">true</bool>
> > >>> >>> <processor>
> > >>> >>>
> > >>> >>> Regards,
> > >>> >>> Edwin
> > >>> >>>
> > >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
> > >>> wrote:
> > >>> >>>
> > >>> >>> > Then you need two regexprocessfactory steps
> > >>> >>> >
> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> > >>> >>> edwinyeozl@gmail.com
> > >>> >>> > >:
> > >>> >>> > >
> > >>> >>> > > Hi,
> > >>> >>> > >
> > >>> >>> > > Thanks for the reply.
> > >>> >>> > >
> > >>> >>> > > Do you know of any regex online tool that works correctly for
> > >>> Java
> > >>> >>> regex?
> > >>> >>> > > I tried to find some, but they are not working properly.
> > >>> >>> > >
> > >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>,
> and
> > >>> >>> single \n
> > >>> >>> > > with single <br>.
> > >>> >>> > >
> > >>> >>> > > Regards,
> > >>> >>> > > Edwin
> > >>> >>> > >
> > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
> > jornfranke@gmail.com
> > >>> >
> > >>> >>> wrote:
> > >>> >>> > >>
> > >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug -
> it
> > >>> would
> > >>> >>> then
> > >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
> > Java
> > >>> >>> regex
> > >>> >>> > for
> > >>> >>> > >> your solution.
> > >>> >>> > >>
> > >>> >>> > >> I believe you want to have 2 regex process factories:
> > >>> >>> > >> One that deals with single \n and one that deals with more
> > than
> > >>> one
> > >>> >>> \n
> > >>> >>> > >>
> > >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> > >>> >>> > edwinyeozl@gmail.com
> > >>> >>> > >>> :
> > >>> >>> > >>>
> > >>> >>> > >>> Hi,
> > >>> >>> > >>>
> > >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
> > and
> > >>> >>> > >>> configuration:
> > >>> >>> > >>>
> > >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>> > >>>  <str name="fieldName">content</str>
> > >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>> > >>>  <bool name="literalReplacement">true</bool>
> > >>> >>> > >>> </processor>
> > >>> >>> > >>>
> > >>> >>> > >>> However, the issue is still occurring.
> > >>> >>> > >>>
> > >>> >>> > >>> Anyone else is able to help?
> > >>> >>> > >>>
> > >>> >>> > >>> Regards,
> > >>> >>> > >>> Edwin
> > >>> >>> > >>>
> > >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> > >>> >>> > edwinyeozl@gmail.com>
> > >>> >>> > >>> wrote:
> > >>> >>> > >>>
> > >>> >>> > >>>> Hi,
> > >>> >>> > >>>>
> > >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
> > well.
> > >>> >>> > >>>>
> > >>> >>> > >>>> Regards,
> > >>> >>> > >>>> Edwin
> > >>> >>> > >>>>
> > >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> > >>> >>> > edwinyeozl@gmail.com
> > >>> >>> > >>>
> > >>> >>> > >>>> wrote:
> > >>> >>> > >>>>
> > >>> >>> > >>>>> Hi,
> > >>> >>> > >>>>>
> > >>> >>> > >>>>> Should we report this as a bug in Solr?
> > >>> >>> > >>>>>
> > >>> >>> > >>>>> Regards,
> > >>> >>> > >>>>> Edwin
> > >>> >>> > >>>>>
> > >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> > >>> >>> > edwinyeozl@gmail.com
> > >>> >>> > >>>
> > >>> >>> > >>>>> wrote:
> > >>> >>> > >>>>>
> > >>> >>> > >>>>>> Hi Paul,
> > >>> >>> > >>>>>>
> > >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when
> we
> > >>> try
> > >>> >>> in on
> > >>> >>> > >>>>>> https://regex101.com/, it is able to give us the
> correct
> > >>> >>> result for
> > >>> >>> > >> all
> > >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>,
> and
> > >>> not
> > >>> >>> more
> > >>> >>> > >> than
> > >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
> > >>> examples).
> > >>> >>> > >>>>>>
> > >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> > >>> >>> > >>>>>>
> > >>> >>> > >>>>>> Regards,
> > >>> >>> > >>>>>> Edwin
> > >>> >>> > >>>>>>
> > >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > >>> >>> > >> edwinyeozl@gmail.com>
> > >>> >>> > >>>>>> wrote:
> > >>> >>> > >>>>>>
> > >>> >>> > >>>>>>> Hi Paul,
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e.
> > <str
> > >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
> > regex
> > >>> >>> pattern:
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>> > >>>>>>>  <str name="fieldName">content</str>
> > >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> > >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>> > >>>>>>> </processor>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> However, we are also getting the exact same results as
> > the
> > >>> >>> earlier
> > >>> >>> > >>>>>>> Example 1, 2 and 3.
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have
> other
> > >>> (non
> > >>> >>> > >>>>>>> printing) characters than \n, we have find that there
> are
> > >>> no
> > >>> >>> non
> > >>> >>> > >> printing
> > >>> >>> > >>>>>>> characters. It is just next line with a space. You can
> > >>> refer
> > >>> >>> to the
> > >>> >>> > >>>>>>> original content in the same examples below.
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
> > >>> working
> > >>> >>> > >>>>>>> correctly
> > >>> >>> > >>>>>>> *Original content in EML file:*
> > >>> >>> > >>>>>>> Dear Sir,
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> I am terminating
> > >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> > >>> terminating
> > >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> terminating
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
> > >>> >>> partially
> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> > >>> <br>)
> > >>> >>> > >>>>>>> *Original content in EML file:*
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> *exalted*
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> *Psalm 89:17*
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> > >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> > >>> >>>  \n\n  3
> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>  <br><br>
> > >>> >>> <br><br>3
> > >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
> > >>> >>> partially
> > >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> > >>> <br>)
> > >>> >>> > >>>>>>> *Original content in EML file:*
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> > >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> > >>>  \n\n
> > >>> >>> >  \n\n
> > >>> >>> > >> \n
> > >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> > >>> \n\n\n  On
> > >>> >>> Tue,
> > >>> >>> > >> Dec 18,
> > >>> >>> > >>>>>>> 2018 at 10:07 AM
> > >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> > >>>  <br><br>
> > >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
> > >>> have.
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Thank you.
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>> Regards,
> > >>> >>> > >>>>>>> Edwin
> > >>> >>> > >>>>>>>
> > >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
> > >>> wrote:
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Hi Edwin
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
> > preceed
> > >>> >>> the \n
> > >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> > >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
> > >>> >>> characters
> > >>> >>> > >>>>>>>> than \n?
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Gesendet von Mail<
> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>> >>> > >> für
> > >>> >>> > >>>>>>>> Windows 10
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> > >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > >>> >>> > solr-user@lucene.apache.org>
> > >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> > >>> detect
> > >>> >>> > >> multiple \n
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Hi Paul,
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
> > >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
> > >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> > >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>> > >>>>>>>> </processor>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> But we still have exactly the same problem of Example
> > 1,2
> > >>> and
> > >>> >>> 3
> > >>> >>> > >> below.
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern
> is
> > >>> >>> working
> > >>> >>> > >>>>>>>> correctly
> > >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> > >>> >>> terminating
> > >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> terminating
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern
> is
> > >>> >>> partially
> > >>> >>> > >>>>>>>> working
> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
> >  \n\n
> > >>> >>>  \n\n
> > >>> >>> > 3
> > >>> >>> > >>>>>>>> Choa
> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>  <br><br>
> > >>> >>> > <br><br>3
> > >>> >>> > >>>>>>>> Choa
> > >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern
> is
> > >>> >>> partially
> > >>> >>> > >>>>>>>> working
> > >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> > >>>  \n\n
> > >>> >>> >  \n\n
> > >>> >>> > >>>>>>>> \n \n\n
> > >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> > On
> > >>> >>> Tue, Dec
> > >>> >>> > >> 18,
> > >>> >>> > >>>>>>>> 2018
> > >>> >>> > >>>>>>>> at 10:07 AM
> > >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> > >>>  <br><br>
> > >>> >>> > >>>>>>>> <br><br>On
> > >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Any further suggestion?
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Thank you.
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>> Regards,
> > >>> >>> > >>>>>>>> Edwin
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
> > >>> wrote:
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
> > >>> failing
> > >>> >>> on
> > >>> >>> > the
> > >>> >>> > >>>>>>>> {2,}
> > >>> >>> > >>>>>>>>> part you could try
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> If you also want to match CRLF then
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Gesendet von Mail<
> > >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> > >>> >>> > >
> > >>> >>> > >>>>>>>> für
> > >>> >>> > >>>>>>>>> Windows 10
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinyeozl@gmail.com
> >
> > >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> > >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > >>> >>> > solr-user@lucene.apache.org
> > >>> >>> > >>>
> > >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> > >>> detect
> > >>> >>> > >> multiple
> > >>> >>> > >>>>>>>> \n
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Hi Paul,
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Thanks for your reply.
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> When I use this pattern:
> > >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> > >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> > >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>> > >>>>>>>>> </processor>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> It is working for some sentence within the same
> content
> > >>> and
> > >>> >>> not
> > >>> >>> > >>>>>>>> working for
> > >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
> > >>> working
> > >>> >>> and
> > >>> >>> > >>>>>>>> another
> > >>> >>> > >>>>>>>>> that is not working (partially working):
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern
> is
> > >>> >>> working
> > >>> >>> > >>>>>>>> correctly
> > >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> > >>> >>> terminating
> > >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> > terminating
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern
> is
> > >>> >>> partially
> > >>> >>> > >>>>>>>> working
> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
> >  \n\n
> > >>> >>> >  \n\n  3
> > >>> >>> > >>>>>>>> Choa
> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
> >  <br><br>
> > >>> >>> > <br><br>3
> > >>> >>> > >>>>>>>> Choa
> > >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern
> is
> > >>> >>> partially
> > >>> >>> > >>>>>>>> working
> > >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>> >>> > >>>>>>>>> *Original content:*
> http://www.concordpri.moe.edu.sg/
> > >>>  \n\n
> > >>> >>> > >> \n\n
> > >>> >>> > >>>>>>>> \n
> > >>> >>> > >>>>>>>>> \n\n
> > >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> > On
> > >>> >>> Tue,
> > >>> >>> > Dec
> > >>> >>> > >>>>>>>> 18, 2018
> > >>> >>> > >>>>>>>>> at 10:07 AM
> > >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> > >>> >>>  <br><br>
> > >>> >>> > >>>>>>>> <br><br>On
> > >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Thank you.
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>> Regards,
> > >>> >>> > >>>>>>>>> Edwin
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.dodd@ub.unibe.ch
> >
> > >>> wrote:
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
> > >>> working. I
> > >>> >>> > assume
> > >>> >>> > >>>>>>>> nothing
> > >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> ??
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> Gesendet von Mail<
> > >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>> >>> > >>>>>>>> für
> > >>> >>> > >>>>>>>>>> Windows 10
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:
> edwinyeozl@gmail.com>
> > >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> > >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > >>> >>> > >> solr-user@lucene.apache.org
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
> > detect
> > >>> >>> multiple
> > >>> >>> > >> \n
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> Hi,
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory
> to
> > >>> >>> remove
> > >>> >>> > more
> > >>> >>> > >>>>>>>> than
> > >>> >>> > >>>>>>>>> two
> > >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n,
> > \n
> > >>> \n,
> > >>> >>> \n
> > >>> >>> > \n
> > >>> >>> > >>>>>>>> \n
> > >>> >>> > >>>>>>>>> \n),
> > >>> >>> > >>>>>>>>>> and replace it with two <br>.
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
> > >>> when I
> > >>> >>> test
> > >>> >>> > it
> > >>> >>> > >>>>>>>> in
> > >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
> > >>> inside
> > >>> >>> the
> > >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> > >>> >>> > >>>>>>>>>> <processor
> class="solr.RegexReplaceProcessorFactory">
> > >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> > >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> > >>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>> >>> > >>>>>>>>>> </processor>
> > >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
> > >>> >>> instructing
> > >>> >>> > the
> > >>> >>> > >>>>>>>> regex
> > >>> >>> > >>>>>>>>> to
> > >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
> > >>> instructing
> > >>> >>> the
> > >>> >>> > >>>>>>>> regex to
> > >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how
> should
> > >>> I do
> > >>> >>> it?
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>> Regards,
> > >>> >>> > >>>>>>>>>> Edwin
> > >>> >>> > >>>>>>>>>>
> > >>> >>> > >>>>>>>>>
> > >>> >>> > >>>>>>>>
> > >>> >>> > >>>>>>>
> > >>> >>> > >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>>
> > >>
> >
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
Hi Edwin



I can’t understand why the pattern is not working and where the spaces between the <br> are coming from. It should be possible to allow for spaces between the <br> in the second match pattern however i.e. 2nd pattern



<str name="pattern">(&lt;br&gt;[ \t\x0b\f]]*){3,}</str>



/Paul



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Mittwoch, 6. März 2019 16:28
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

I have tried with the first match pattern to be <str name="pattern">[
\t\x0b\f]*\r?\n</str>, like the configuration below:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t\x0b\f]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the result is still the same as before (previous index results),
with the 4 <br>.

Regards,
Edwin


On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
> You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>, it’s
> actually the sequence «<br><br>  <br><br>»? So perhaps the first match
> pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>
>
>
> i.e. [space tab vertical-tab formfeed]
>
>
>
> Regards,
>
> Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 07:44
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
> (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
> will actually look for 6 or more <br> instead of 3 <br>,  as we have put
> the <br> two times in the pattern, which is the reason that there are more
> <br> in the result, as cases where there are less than 6 <br> are not being
> replaced, so we ended up having up to 5 <br> in the index.
>
> Modified configuration:
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>
> This will bring us back to the result of the previous index content,
> meaning the issue of having the 4 <br> is still there.
>
> Regards,
> Edwin
>
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Paul,
> >
> > Further to my previous email, which there was an extra "}" in the
> > configuration, I have changed to use the below configuration based on
> your
> > suggestion.
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">[ \t]*\r?\n</str>
> >    <str name="replacement">&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the result that I get still has more than 2 <br>. In fact, the
> > result become worse, as you can see from the comparison below.
> >
> > Example 1: The sentence that the regex pattern used to work correctly.
> But
> > with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> > which is wrong.
> > *Original content in EML file:*
> > Dear Sir,
> >
> >
> > I am terminating
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
> terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content in EML file:*
> >
> > *exalted*
> >
> > *Psalm 89:17*
> >
> >
> > 3 Choa Chu Kang Avenue 4
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > <br><br>3 Choa Chu Kang Avenue 4, Singapore
> > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> > Choa Chu Kang Avenue 3, Singapor4
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
> code,
> > there are now 5 <br>
> > *Original content in EML file:*
> >
> > http://www.concorded.com/
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> > 10:07 AM
> > *Previous Index content: *http://www.concorded.com/   <br><br>
> > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> > On Tue, Dec 18, 2018 at 10:07 AM
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Thank you for the reply.
> >>
> >> I have tried to add the following configuration according to your
> >> suggestion:
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">[ \t]*\r?\n}</str>
> >>    <str name="replacement">&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> However, none of the \n is being removed this time round.
> >> Is the order and/or the pattern correct?
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
> >>
> >>> Hi Edwin
> >>>
> >>>
> >>>
> >>> Try for the first pattern/replacement
> >>>
> >>>
> >>>
> >>> <str name="pattern">[ \t]*\r?\n</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Now all line endings and preceding whitespace characters should be
> >>> changed to ‘<br>’.
> >>>
> >>>
> >>>
> >>> The second pattern replacement should replace 3 or more ‘<br>’
> sequences
> >>> to 2 ‘<br>’ sequences:
> >>>
> >>>
> >>>
> >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Hope this approach works. Sorry for not replying earlier and best
> >>> regards,
> >>>
> >>> Paul
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> Gesendet: Dienstag, 5. März 2019 03:35
> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >>>
> >>>
> >>>
> >>> Hi,
> >>>
> >>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Anyone else has other suggestions or have faced the same problem?
> >>> >
> >>> > Regards,
> >>> > Edwin
> >>> >
> >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
> >>> edwinyeozl@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi Paul,
> >>> >>
> >>> >> If I tried to execute the second step first, then I will only get a
> >>> >> single <br> for those with 2 <br>.
> >>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
> >>> >> space in between.
> >>> >>
> >>> >> This is just changing the 2 <br> to be a single <br>, since the
> second
> >>> >> step is to replace with a single <br>.
> >>> >> But it has not solved the underlying problem yet.
> >>> >>
> >>> >> Regards,
> >>> >> Edwin
> >>> >>
> >>> >>
> >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
> >>> >>
> >>> >>> If the second step is executed first, then you will get the
> unwanted
> >>> 4
> >>> >>> <br>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> für
> >>> >>> Windows 10
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>> \n
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Hi Jörn ,
> >>> >>>
> >>> >>> Do you mean the regex is not correct?
> >>> >>>
> >>> >>> We are already using two RegexReplaceProcessorFactory steps, like
> >>> the one
> >>> >>> shown below. The output that we get is still the same.
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> Regards,
> >>> >>> Edwin
> >>> >>>
> >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
> >>> wrote:
> >>> >>>
> >>> >>> > Then you need two regexprocessfactory steps
> >>> >>> >
> >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>> >>> edwinyeozl@gmail.com
> >>> >>> > >:
> >>> >>> > >
> >>> >>> > > Hi,
> >>> >>> > >
> >>> >>> > > Thanks for the reply.
> >>> >>> > >
> >>> >>> > > Do you know of any regex online tool that works correctly for
> >>> Java
> >>> >>> regex?
> >>> >>> > > I tried to find some, but they are not working properly.
> >>> >>> > >
> >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
> >>> >>> single \n
> >>> >>> > > with single <br>.
> >>> >>> > >
> >>> >>> > > Regards,
> >>> >>> > > Edwin
> >>> >>> > >
> >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
> jornfranke@gmail.com
> >>> >
> >>> >>> wrote:
> >>> >>> > >>
> >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
> >>> would
> >>> >>> then
> >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
> Java
> >>> >>> regex
> >>> >>> > for
> >>> >>> > >> your solution.
> >>> >>> > >>
> >>> >>> > >> I believe you want to have 2 regex process factories:
> >>> >>> > >> One that deals with single \n and one that deals with more
> than
> >>> one
> >>> >>> \n
> >>> >>> > >>
> >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>> :
> >>> >>> > >>>
> >>> >>> > >>> Hi,
> >>> >>> > >>>
> >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
> and
> >>> >>> > >>> configuration:
> >>> >>> > >>>
> >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>  <str name="fieldName">content</str>
> >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>  <bool name="literalReplacement">true</bool>
> >>> >>> > >>> </processor>
> >>> >>> > >>>
> >>> >>> > >>> However, the issue is still occurring.
> >>> >>> > >>>
> >>> >>> > >>> Anyone else is able to help?
> >>> >>> > >>>
> >>> >>> > >>> Regards,
> >>> >>> > >>> Edwin
> >>> >>> > >>>
> >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>
> >>> >>> > >>>> Hi,
> >>> >>> > >>>>
> >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
> well.
> >>> >>> > >>>>
> >>> >>> > >>>> Regards,
> >>> >>> > >>>> Edwin
> >>> >>> > >>>>
> >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>>> Hi,
> >>> >>> > >>>>>
> >>> >>> > >>>>> Should we report this as a bug in Solr?
> >>> >>> > >>>>>
> >>> >>> > >>>>> Regards,
> >>> >>> > >>>>> Edwin
> >>> >>> > >>>>>
> >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>>> wrote:
> >>> >>> > >>>>>
> >>> >>> > >>>>>> Hi Paul,
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
> >>> try
> >>> >>> in on
> >>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
> >>> >>> result for
> >>> >>> > >> all
> >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
> >>> not
> >>> >>> more
> >>> >>> > >> than
> >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
> >>> examples).
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regards,
> >>> >>> > >>>>>> Edwin
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >>> >>> > >> edwinyeozl@gmail.com>
> >>> >>> > >>>>>> wrote:
> >>> >>> > >>>>>>
> >>> >>> > >>>>>>> Hi Paul,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e.
> <str
> >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
> regex
> >>> >>> pattern:
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>> </processor>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> However, we are also getting the exact same results as
> the
> >>> >>> earlier
> >>> >>> > >>>>>>> Example 1, 2 and 3.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
> >>> (non
> >>> >>> > >>>>>>> printing) characters than \n, we have find that there are
> >>> no
> >>> >>> non
> >>> >>> > >> printing
> >>> >>> > >>>>>>> characters. It is just next line with a space. You can
> >>> refer
> >>> >>> to the
> >>> >>> > >>>>>>> original content in the same examples below.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
> >>> working
> >>> >>> > >>>>>>> correctly
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>> Dear Sir,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> I am terminating
> >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> terminating
> >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *exalted*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *Psalm 89:17*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>> >>>  \n\n  3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> <br><br>3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >> \n
> >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> >>> \n\n\n  On
> >>> >>> Tue,
> >>> >>> > >> Dec 18,
> >>> >>> > >>>>>>> 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
> >>> have.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Thank you.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Regards,
> >>> >>> > >>>>>>> Edwin
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
> preceed
> >>> >>> the \n
> >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
> >>> >>> characters
> >>> >>> > >>>>>>>> than \n?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >> für
> >>> >>> > >>>>>>>> Windows 10
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org>
> >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple \n
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
> >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>> </processor>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> But we still have exactly the same problem of Example
> 1,2
> >>> and
> >>> >>> 3
> >>> >>> > >> below.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>>  \n\n
> >>> >>> > 3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >>>>>>>> \n \n\n
> >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue, Dec
> >>> >>> > >> 18,
> >>> >>> > >>>>>>>> 2018
> >>> >>> > >>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Any further suggestion?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Thank you.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Regards,
> >>> >>> > >>>>>>>> Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
> >>> failing
> >>> >>> on
> >>> >>> > the
> >>> >>> > >>>>>>>> {2,}
> >>> >>> > >>>>>>>>> part you could try
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> If you also want to match CRLF then
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> >>> >>> > >
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org
> >>> >>> > >>>
> >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thanks for your reply.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> When I use this pattern:
> >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> It is working for some sentence within the same content
> >>> and
> >>> >>> not
> >>> >>> > >>>>>>>> working for
> >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
> >>> working
> >>> >>> and
> >>> >>> > >>>>>>>> another
> >>> >>> > >>>>>>>>> that is not working (partially working):
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> terminating
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>> >  \n\n  3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>  <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> > >> \n\n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n\n
> >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue,
> >>> >>> > Dec
> >>> >>> > >>>>>>>> 18, 2018
> >>> >>> > >>>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thank you.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Regards,
> >>> >>> > >>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
> >>> working. I
> >>> >>> > assume
> >>> >>> > >>>>>>>> nothing
> >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> ??
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Gesendet von Mail<
> >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > >> solr-user@lucene.apache.org
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
> detect
> >>> >>> multiple
> >>> >>> > >> \n
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Hi,
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
> >>> >>> remove
> >>> >>> > more
> >>> >>> > >>>>>>>> than
> >>> >>> > >>>>>>>>> two
> >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n,
> \n
> >>> \n,
> >>> >>> \n
> >>> >>> > \n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n),
> >>> >>> > >>>>>>>>>> and replace it with two <br>.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
> >>> when I
> >>> >>> test
> >>> >>> > it
> >>> >>> > >>>>>>>> in
> >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
> >>> inside
> >>> >>> the
> >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
> >>> >>> instructing
> >>> >>> > the
> >>> >>> > >>>>>>>> regex
> >>> >>> > >>>>>>>>> to
> >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
> >>> instructing
> >>> >>> the
> >>> >>> > >>>>>>>> regex to
> >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
> >>> I do
> >>> >>> it?
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Regards,
> >>> >>> > >>>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

I have tried with the first match pattern to be <str name="pattern">[
\t\x0b\f]*\r?\n</str>, like the configuration below:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t\x0b\f]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the result is still the same as before (previous index results),
with the 4 <br>.

Regards,
Edwin


On Wed, 6 Mar 2019 at 18:23, <pa...@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
> You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>, it’s
> actually the sequence «<br><br>  <br><br>»? So perhaps the first match
> pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>
>
>
>
> i.e. [space tab vertical-tab formfeed]
>
>
>
> Regards,
>
> Paul
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Mittwoch, 6. März 2019 07:44
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
> (&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
> will actually look for 6 or more <br> instead of 3 <br>,  as we have put
> the <br> two times in the pattern, which is the reason that there are more
> <br> in the result, as cases where there are less than 6 <br> are not being
> replaced, so we ended up having up to 5 <br> in the index.
>
> Modified configuration:
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
>  </processor>
>
> This will bring us back to the result of the previous index content,
> meaning the issue of having the 4 <br> is still there.
>
> Regards,
> Edwin
>
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Paul,
> >
> > Further to my previous email, which there was an extra "}" in the
> > configuration, I have changed to use the below configuration based on
> your
> > suggestion.
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">[ \t]*\r?\n</str>
> >    <str name="replacement">&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >    <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the result that I get still has more than 2 <br>. In fact, the
> > result become worse, as you can see from the comparison below.
> >
> > Example 1: The sentence that the regex pattern used to work correctly.
> But
> > with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> > which is wrong.
> > *Original content in EML file:*
> > Dear Sir,
> >
> >
> > I am terminating
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> > *Current Index content*:   Dear Sir, <br><br><br><br><br> I am
> terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content in EML file:*
> >
> > *exalted*
> >
> > *Psalm 89:17*
> >
> >
> > 3 Choa Chu Kang Avenue 4
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > <br><br>3 Choa Chu Kang Avenue 4, Singapore
> > *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> > Choa Chu Kang Avenue 3, Singapor4
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest
> code,
> > there are now 5 <br>
> > *Original content in EML file:*
> >
> > http://www.concorded.com/
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Dec 18, 2018 at 10:07 AM
> > *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> > 10:07 AM
> > *Previous Index content: *http://www.concorded.com/   <br><br>
> > <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> > On Tue, Dec 18, 2018 at 10:07 AM
> >
> >
> > Regards,
> > Edwin
> >
> > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi Paul,
> >>
> >> Thank you for the reply.
> >>
> >> I have tried to add the following configuration according to your
> >> suggestion:
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">[ \t]*\r?\n}</str>
> >>    <str name="replacement">&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> <processor class="solr.RegexReplaceProcessorFactory">
> >>    <str name="fieldName">content</str>
> >>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>    <bool name="literalReplacement">true</bool>
> >> </processor>
> >>
> >> However, none of the \n is being removed this time round.
> >> Is the order and/or the pattern correct?
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
> >>
> >>> Hi Edwin
> >>>
> >>>
> >>>
> >>> Try for the first pattern/replacement
> >>>
> >>>
> >>>
> >>> <str name="pattern">[ \t]*\r?\n</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Now all line endings and preceding whitespace characters should be
> >>> changed to ‘<br>’.
> >>>
> >>>
> >>>
> >>> The second pattern replacement should replace 3 or more ‘<br>’
> sequences
> >>> to 2 ‘<br>’ sequences:
> >>>
> >>>
> >>>
> >>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
> >>>
> >>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>
> >>>
> >>>
> >>> Hope this approach works. Sorry for not replying earlier and best
> >>> regards,
> >>>
> >>> Paul
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> Gesendet: Dienstag, 5. März 2019 03:35
> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >>>
> >>>
> >>>
> >>> Hi,
> >>>
> >>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Anyone else has other suggestions or have faced the same problem?
> >>> >
> >>> > Regards,
> >>> > Edwin
> >>> >
> >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
> >>> edwinyeozl@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi Paul,
> >>> >>
> >>> >> If I tried to execute the second step first, then I will only get a
> >>> >> single <br> for those with 2 <br>.
> >>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
> >>> >> space in between.
> >>> >>
> >>> >> This is just changing the 2 <br> to be a single <br>, since the
> second
> >>> >> step is to replace with a single <br>.
> >>> >> But it has not solved the underlying problem yet.
> >>> >>
> >>> >> Regards,
> >>> >> Edwin
> >>> >>
> >>> >>
> >>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
> >>> >>
> >>> >>> If the second step is executed first, then you will get the
> unwanted
> >>> 4
> >>> >>> <br>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> für
> >>> >>> Windows 10
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>> \n
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> Hi Jörn ,
> >>> >>>
> >>> >>> Do you mean the regex is not correct?
> >>> >>>
> >>> >>> We are already using two RegexReplaceProcessorFactory steps, like
> >>> the one
> >>> >>> shown below. The output that we get is still the same.
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>>      <str name="fieldName">content</str>
> >>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> >>> >>>      <str name="replacement">&lt;br&gt;</str>
> >>> >>>      <bool name="literalReplacement">true</bool>
> >>> >>> <processor>
> >>> >>>
> >>> >>> Regards,
> >>> >>> Edwin
> >>> >>>
> >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
> >>> wrote:
> >>> >>>
> >>> >>> > Then you need two regexprocessfactory steps
> >>> >>> >
> >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>> >>> edwinyeozl@gmail.com
> >>> >>> > >:
> >>> >>> > >
> >>> >>> > > Hi,
> >>> >>> > >
> >>> >>> > > Thanks for the reply.
> >>> >>> > >
> >>> >>> > > Do you know of any regex online tool that works correctly for
> >>> Java
> >>> >>> regex?
> >>> >>> > > I tried to find some, but they are not working properly.
> >>> >>> > >
> >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
> >>> >>> single \n
> >>> >>> > > with single <br>.
> >>> >>> > >
> >>> >>> > > Regards,
> >>> >>> > > Edwin
> >>> >>> > >
> >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <
> jornfranke@gmail.com
> >>> >
> >>> >>> wrote:
> >>> >>> > >>
> >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
> >>> would
> >>> >>> then
> >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports
> Java
> >>> >>> regex
> >>> >>> > for
> >>> >>> > >> your solution.
> >>> >>> > >>
> >>> >>> > >> I believe you want to have 2 regex process factories:
> >>> >>> > >> One that deals with single \n and one that deals with more
> than
> >>> one
> >>> >>> \n
> >>> >>> > >>
> >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>> :
> >>> >>> > >>>
> >>> >>> > >>> Hi,
> >>> >>> > >>>
> >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,}
> and
> >>> >>> > >>> configuration:
> >>> >>> > >>>
> >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>  <str name="fieldName">content</str>
> >>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>  <bool name="literalReplacement">true</bool>
> >>> >>> > >>> </processor>
> >>> >>> > >>>
> >>> >>> > >>> However, the issue is still occurring.
> >>> >>> > >>>
> >>> >>> > >>> Anyone else is able to help?
> >>> >>> > >>>
> >>> >>> > >>> Regards,
> >>> >>> > >>> Edwin
> >>> >>> > >>>
> >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>
> >>> >>> > >>>> Hi,
> >>> >>> > >>>>
> >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as
> well.
> >>> >>> > >>>>
> >>> >>> > >>>> Regards,
> >>> >>> > >>>> Edwin
> >>> >>> > >>>>
> >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>>> Hi,
> >>> >>> > >>>>>
> >>> >>> > >>>>> Should we report this as a bug in Solr?
> >>> >>> > >>>>>
> >>> >>> > >>>>> Regards,
> >>> >>> > >>>>> Edwin
> >>> >>> > >>>>>
> >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> >>> >>> > edwinyeozl@gmail.com
> >>> >>> > >>>
> >>> >>> > >>>>> wrote:
> >>> >>> > >>>>>
> >>> >>> > >>>>>> Hi Paul,
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
> >>> try
> >>> >>> in on
> >>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
> >>> >>> result for
> >>> >>> > >> all
> >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
> >>> not
> >>> >>> more
> >>> >>> > >> than
> >>> >>> > >>>>>> that like what we are getting in Solr in our earlier
> >>> examples).
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> Regards,
> >>> >>> > >>>>>> Edwin
> >>> >>> > >>>>>>
> >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >>> >>> > >> edwinyeozl@gmail.com>
> >>> >>> > >>>>>> wrote:
> >>> >>> > >>>>>>
> >>> >>> > >>>>>>> Hi Paul,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e.
> <str
> >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following
> regex
> >>> >>> pattern:
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>> </processor>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> However, we are also getting the exact same results as
> the
> >>> >>> earlier
> >>> >>> > >>>>>>> Example 1, 2 and 3.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
> >>> (non
> >>> >>> > >>>>>>> printing) characters than \n, we have find that there are
> >>> no
> >>> >>> non
> >>> >>> > >> printing
> >>> >>> > >>>>>>> characters. It is just next line with a space. You can
> >>> refer
> >>> >>> to the
> >>> >>> > >>>>>>> original content in the same examples below.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
> >>> working
> >>> >>> > >>>>>>> correctly
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>> Dear Sir,
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> I am terminating
> >>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> terminating
> >>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *exalted*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> *Psalm 89:17*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> >>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>> >>>  \n\n  3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> <br><br>3
> >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
> >>> <br>)
> >>> >>> > >>>>>>> *Original content in EML file:*
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >> \n
> >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
> >>> \n\n\n  On
> >>> >>> Tue,
> >>> >>> > >> Dec 18,
> >>> >>> > >>>>>>> 2018 at 10:07 AM
> >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
> >>> have.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Thank you.
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>> Regards,
> >>> >>> > >>>>>>> Edwin
> >>> >>> > >>>>>>>
> >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should
> preceed
> >>> >>> the \n
> >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
> >>> >>> characters
> >>> >>> > >>>>>>>> than \n?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >> für
> >>> >>> > >>>>>>>> Windows 10
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org>
> >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple \n
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
> >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>> </processor>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> But we still have exactly the same problem of Example
> 1,2
> >>> and
> >>> >>> 3
> >>> >>> > >> below.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>>  \n\n
> >>> >>> > 3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> >  \n\n
> >>> >>> > >>>>>>>> \n \n\n
> >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue, Dec
> >>> >>> > >> 18,
> >>> >>> > >>>>>>>> 2018
> >>> >>> > >>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Any further suggestion?
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Thank you.
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>> Regards,
> >>> >>> > >>>>>>>> Edwin
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
> >>> failing
> >>> >>> on
> >>> >>> > the
> >>> >>> > >>>>>>>> {2,}
> >>> >>> > >>>>>>>>> part you could try
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> If you also want to match CRLF then
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Gesendet von Mail<
> >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> >>> >>> > >
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > solr-user@lucene.apache.org
> >>> >>> > >>>
> >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
> >>> detect
> >>> >>> > >> multiple
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Hi Paul,
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thanks for your reply.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> When I use this pattern:
> >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> It is working for some sentence within the same content
> >>> and
> >>> >>> not
> >>> >>> > >>>>>>>> working for
> >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
> >>> working
> >>> >>> and
> >>> >>> > >>>>>>>> another
> >>> >>> > >>>>>>>>> that is not working (partially working):
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> >>> working
> >>> >>> > >>>>>>>> correctly
> >>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> >>> terminating
> >>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am
> terminating
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17
>  \n\n
> >>> >>> >  \n\n  3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17
>  <br><br>
> >>> >>> > <br><br>3
> >>> >>> > >>>>>>>> Choa
> >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> >>> partially
> >>> >>> > >>>>>>>> working
> >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
> >>>  \n\n
> >>> >>> > >> \n\n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n\n
> >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> >>> Tue,
> >>> >>> > Dec
> >>> >>> > >>>>>>>> 18, 2018
> >>> >>> > >>>>>>>>> at 10:07 AM
> >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>> >>>  <br><br>
> >>> >>> > >>>>>>>> <br><br>On
> >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Thank you.
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>> Regards,
> >>> >>> > >>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
> >>> wrote:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
> >>> working. I
> >>> >>> > assume
> >>> >>> > >>>>>>>> nothing
> >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> ??
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Gesendet von Mail<
> >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> >>> > >>>>>>>> für
> >>> >>> > >>>>>>>>>> Windows 10
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> >>> > >> solr-user@lucene.apache.org
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to
> detect
> >>> >>> multiple
> >>> >>> > >> \n
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Hi,
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
> >>> >>> remove
> >>> >>> > more
> >>> >>> > >>>>>>>> than
> >>> >>> > >>>>>>>>> two
> >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n,
> \n
> >>> \n,
> >>> >>> \n
> >>> >>> > \n
> >>> >>> > >>>>>>>> \n
> >>> >>> > >>>>>>>>> \n),
> >>> >>> > >>>>>>>>>> and replace it with two <br>.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
> >>> when I
> >>> >>> test
> >>> >>> > it
> >>> >>> > >>>>>>>> in
> >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
> >>> inside
> >>> >>> the
> >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> >>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> >>> > >>>>>>>>>> </processor>
> >>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
> >>> >>> instructing
> >>> >>> > the
> >>> >>> > >>>>>>>> regex
> >>> >>> > >>>>>>>>> to
> >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
> >>> instructing
> >>> >>> the
> >>> >>> > >>>>>>>> regex to
> >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
> >>> I do
> >>> >>> it?
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>> Regards,
> >>> >>> > >>>>>>>>>> Edwin
> >>> >>> > >>>>>>>>>>
> >>> >>> > >>>>>>>>>
> >>> >>> > >>>>>>>>
> >>> >>> > >>>>>>>
> >>> >>> > >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
Hi Edwin



You are correct  re the 2nd pattern – my bad. Looking at the 4 <br>, it’s actually the sequence «<br><br>  <br><br>»? So perhaps the first match pattern could be <str name="pattern">[ \t\x0b\f]*\r?\n</str>



i.e. [space tab vertical-tab formfeed]



Regards,

Paul



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Mittwoch, 6. März 2019 07:44
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
(&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
will actually look for 6 or more <br> instead of 3 <br>,  as we have put
the <br> two times in the pattern, which is the reason that there are more
<br> in the result, as cases where there are less than 6 <br> are not being
replaced, so we ended up having up to 5 <br> in the index.

Modified configuration:
 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
 </processor>

This will bring us back to the result of the previous index content,
meaning the issue of having the 4 <br> is still there.

Regards,
Edwin



Regards,
Edwin

On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Further to my previous email, which there was an extra "}" in the
> configuration, I have changed to use the below configuration based on your
> suggestion.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, the result that I get still has more than 2 <br>. In fact, the
> result become worse, as you can see from the comparison below.
>
> Example 1: The sentence that the regex pattern used to work correctly. But
> with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> which is wrong.
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> *Current Index content*:   Dear Sir, <br><br><br><br><br> I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3 Choa Chu Kang Avenue 4, Singapore
> *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> Choa Chu Kang Avenue 3, Singapor4
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
> there are now 5 <br>
> *Original content in EML file:*
>
> http://www.concorded.com/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> 10:07 AM
> *Previous Index content: *http://www.concorded.com/   <br><br>
> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> On Tue, Dec 18, 2018 at 10:07 AM
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Thank you for the reply.
>>
>> I have tried to add the following configuration according to your
>> suggestion:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t]*\r?\n}</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> However, none of the \n is being removed this time round.
>> Is the order and/or the pattern correct?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> Try for the first pattern/replacement
>>>
>>>
>>>
>>> <str name="pattern">[ \t]*\r?\n</str>
>>>
>>> <str name="replacement">&lt;br&gt;</str>
>>>
>>>
>>>
>>> Now all line endings and preceding whitespace characters should be
>>> changed to ‘<br>’.
>>>
>>>
>>>
>>> The second pattern replacement should replace 3 or more ‘<br>’ sequences
>>> to 2 ‘<br>’ sequences:
>>>
>>>
>>>
>>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>
>>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>
>>>
>>>
>>> Hope this approach works. Sorry for not replying earlier and best
>>> regards,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Dienstag, 5. März 2019 03:35
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi,
>>>
>>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Anyone else has other suggestions or have faced the same problem?
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>> edwinyeozl@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Paul,
>>> >>
>>> >> If I tried to execute the second step first, then I will only get a
>>> >> single <br> for those with 2 <br>.
>>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
>>> >> space in between.
>>> >>
>>> >> This is just changing the 2 <br> to be a single <br>, since the second
>>> >> step is to replace with a single <br>.
>>> >> But it has not solved the underlying problem yet.
>>> >>
>>> >> Regards,
>>> >> Edwin
>>> >>
>>> >>
>>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>> >>
>>> >>> If the second step is executed first, then you will get the unwanted
>>> 4
>>> >>> <br>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> >>> Windows 10
>>> >>>
>>> >>>
>>> >>>
>>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>> \n
>>> >>>
>>> >>>
>>> >>>
>>> >>> Hi Jörn ,
>>> >>>
>>> >>> Do you mean the regex is not correct?
>>> >>>
>>> >>> We are already using two RegexReplaceProcessorFactory steps, like
>>> the one
>>> >>> shown below. The output that we get is still the same.
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> Regards,
>>> >>> Edwin
>>> >>>
>>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > Then you need two regexprocessfactory steps
>>> >>> >
>>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> >>> edwinyeozl@gmail.com
>>> >>> > >:
>>> >>> > >
>>> >>> > > Hi,
>>> >>> > >
>>> >>> > > Thanks for the reply.
>>> >>> > >
>>> >>> > > Do you know of any regex online tool that works correctly for
>>> Java
>>> >>> regex?
>>> >>> > > I tried to find some, but they are not working properly.
>>> >>> > >
>>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>>> >>> single \n
>>> >>> > > with single <br>.
>>> >>> > >
>>> >>> > > Regards,
>>> >>> > > Edwin
>>> >>> > >
>>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > >>
>>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
>>> would
>>> >>> then
>>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> >>> regex
>>> >>> > for
>>> >>> > >> your solution.
>>> >>> > >>
>>> >>> > >> I believe you want to have 2 regex process factories:
>>> >>> > >> One that deals with single \n and one that deals with more than
>>> one
>>> >>> \n
>>> >>> > >>
>>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>> :
>>> >>> > >>>
>>> >>> > >>> Hi,
>>> >>> > >>>
>>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> >>> > >>> configuration:
>>> >>> > >>>
>>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>  <str name="fieldName">content</str>
>>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>> >>> > >>> </processor>
>>> >>> > >>>
>>> >>> > >>> However, the issue is still occurring.
>>> >>> > >>>
>>> >>> > >>> Anyone else is able to help?
>>> >>> > >>>
>>> >>> > >>> Regards,
>>> >>> > >>> Edwin
>>> >>> > >>>
>>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com>
>>> >>> > >>> wrote:
>>> >>> > >>>
>>> >>> > >>>> Hi,
>>> >>> > >>>>
>>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>> >>> > >>>>
>>> >>> > >>>> Regards,
>>> >>> > >>>> Edwin
>>> >>> > >>>>
>>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>>
>>> >>> > >>>> wrote:
>>> >>> > >>>>
>>> >>> > >>>>> Hi,
>>> >>> > >>>>>
>>> >>> > >>>>> Should we report this as a bug in Solr?
>>> >>> > >>>>>
>>> >>> > >>>>> Regards,
>>> >>> > >>>>> Edwin
>>> >>> > >>>>>
>>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>>
>>> >>> > >>>>> wrote:
>>> >>> > >>>>>
>>> >>> > >>>>>> Hi Paul,
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
>>> try
>>> >>> in on
>>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
>>> >>> result for
>>> >>> > >> all
>>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
>>> not
>>> >>> more
>>> >>> > >> than
>>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>> examples).
>>> >>> > >>>>>>
>>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regards,
>>> >>> > >>>>>> Edwin
>>> >>> > >>>>>>
>>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> >>> > >> edwinyeozl@gmail.com>
>>> >>> > >>>>>> wrote:
>>> >>> > >>>>>>
>>> >>> > >>>>>>> Hi Paul,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>>> >>> pattern:
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>> </processor>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> However, we are also getting the exact same results as the
>>> >>> earlier
>>> >>> > >>>>>>> Example 1, 2 and 3.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
>>> (non
>>> >>> > >>>>>>> printing) characters than \n, we have find that there are
>>> no
>>> >>> non
>>> >>> > >> printing
>>> >>> > >>>>>>> characters. It is just next line with a space. You can
>>> refer
>>> >>> to the
>>> >>> > >>>>>>> original content in the same examples below.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> >>> > >>>>>>> correctly
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>> Dear Sir,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> I am terminating
>>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *exalted*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *Psalm 89:17*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n  3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> <br><br>3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >> \n
>>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> \n\n\n  On
>>> >>> Tue,
>>> >>> > >> Dec 18,
>>> >>> > >>>>>>> 2018 at 10:07 AM
>>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
>>> have.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Thank you.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Regards,
>>> >>> > >>>>>>> Edwin
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>>> >>> the \n
>>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>>> >>> characters
>>> >>> > >>>>>>>> than \n?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >> für
>>> >>> > >>>>>>>> Windows 10
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org>
>>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple \n
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>> </processor>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
>>> and
>>> >>> 3
>>> >>> > >> below.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n
>>> >>> > 3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >>>>>>>> \n \n\n
>>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue, Dec
>>> >>> > >> 18,
>>> >>> > >>>>>>>> 2018
>>> >>> > >>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Any further suggestion?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Thank you.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Regards,
>>> >>> > >>>>>>>> Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>>> failing
>>> >>> on
>>> >>> > the
>>> >>> > >>>>>>>> {2,}
>>> >>> > >>>>>>>>> part you could try
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> If you also want to match CRLF then
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> >>> > >
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org
>>> >>> > >>>
>>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thanks for your reply.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> When I use this pattern:
>>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>> </processor>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> It is working for some sentence within the same content
>>> and
>>> >>> not
>>> >>> > >>>>>>>> working for
>>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>>> working
>>> >>> and
>>> >>> > >>>>>>>> another
>>> >>> > >>>>>>>>> that is not working (partially working):
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>> >  \n\n  3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> > >> \n\n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n\n
>>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue,
>>> >>> > Dec
>>> >>> > >>>>>>>> 18, 2018
>>> >>> > >>>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> >>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thank you.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Regards,
>>> >>> > >>>>>>>>> Edwin
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>> working. I
>>> >>> > assume
>>> >>> > >>>>>>>> nothing
>>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> ??
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Gesendet von Mail<
>>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > >> solr-user@lucene.apache.org
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>>> >>> multiple
>>> >>> > >> \n
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Hi,
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>>> >>> remove
>>> >>> > more
>>> >>> > >>>>>>>> than
>>> >>> > >>>>>>>>> two
>>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
>>> \n,
>>> >>> \n
>>> >>> > \n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n),
>>> >>> > >>>>>>>>>> and replace it with two <br>.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
>>> when I
>>> >>> test
>>> >>> > it
>>> >>> > >>>>>>>> in
>>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
>>> inside
>>> >>> the
>>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>>> </processor>
>>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> >>> instructing
>>> >>> > the
>>> >>> > >>>>>>>> regex
>>> >>> > >>>>>>>>> to
>>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>> instructing
>>> >>> the
>>> >>> > >>>>>>>> regex to
>>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
>>> I do
>>> >>> it?
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Regards,
>>> >>> > >>>>>>>>>> Edwin
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

I have modified the second pattern to be (&lt;br&gt;){3,}, instead of
(&lt;br&gt;&lt;br&gt;){3,}. This pattern of  (&lt;br&gt;&lt;br&gt;){3,}
will actually look for 6 or more <br> instead of 3 <br>,  as we have put
the <br> two times in the pattern, which is the reason that there are more
<br> in the result, as cases where there are less than 6 <br> are not being
replaced, so we ended up having up to 5 <br> in the index.

Modified configuration:
 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
 </processor>

This will bring us back to the result of the previous index content,
meaning the issue of having the 4 <br> is still there.

Regards,
Edwin



Regards,
Edwin

On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Further to my previous email, which there was an extra "}" in the
> configuration, I have changed to use the below configuration based on your
> suggestion.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, the result that I get still has more than 2 <br>. In fact, the
> result become worse, as you can see from the comparison below.
>
> Example 1: The sentence that the regex pattern used to work correctly. But
> with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> which is wrong.
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> *Current Index content*:   Dear Sir, <br><br><br><br><br> I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3 Choa Chu Kang Avenue 4, Singapore
> *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> Choa Chu Kang Avenue 3, Singapor4
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
> there are now 5 <br>
> *Original content in EML file:*
>
> http://www.concorded.com/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> 10:07 AM
> *Previous Index content: *http://www.concorded.com/   <br><br>
> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> On Tue, Dec 18, 2018 at 10:07 AM
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Thank you for the reply.
>>
>> I have tried to add the following configuration according to your
>> suggestion:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t]*\r?\n}</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> However, none of the \n is being removed this time round.
>> Is the order and/or the pattern correct?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> Try for the first pattern/replacement
>>>
>>>
>>>
>>> <str name="pattern">[ \t]*\r?\n</str>
>>>
>>> <str name="replacement">&lt;br&gt;</str>
>>>
>>>
>>>
>>> Now all line endings and preceding whitespace characters should be
>>> changed to ‘<br>’.
>>>
>>>
>>>
>>> The second pattern replacement should replace 3 or more ‘<br>’ sequences
>>> to 2 ‘<br>’ sequences:
>>>
>>>
>>>
>>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>
>>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>
>>>
>>>
>>> Hope this approach works. Sorry for not replying earlier and best
>>> regards,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Dienstag, 5. März 2019 03:35
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi,
>>>
>>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Anyone else has other suggestions or have faced the same problem?
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>> edwinyeozl@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Paul,
>>> >>
>>> >> If I tried to execute the second step first, then I will only get a
>>> >> single <br> for those with 2 <br>.
>>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
>>> >> space in between.
>>> >>
>>> >> This is just changing the 2 <br> to be a single <br>, since the second
>>> >> step is to replace with a single <br>.
>>> >> But it has not solved the underlying problem yet.
>>> >>
>>> >> Regards,
>>> >> Edwin
>>> >>
>>> >>
>>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>> >>
>>> >>> If the second step is executed first, then you will get the unwanted
>>> 4
>>> >>> <br>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> >>> Windows 10
>>> >>>
>>> >>>
>>> >>>
>>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>> \n
>>> >>>
>>> >>>
>>> >>>
>>> >>> Hi Jörn ,
>>> >>>
>>> >>> Do you mean the regex is not correct?
>>> >>>
>>> >>> We are already using two RegexReplaceProcessorFactory steps, like
>>> the one
>>> >>> shown below. The output that we get is still the same.
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> Regards,
>>> >>> Edwin
>>> >>>
>>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > Then you need two regexprocessfactory steps
>>> >>> >
>>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> >>> edwinyeozl@gmail.com
>>> >>> > >:
>>> >>> > >
>>> >>> > > Hi,
>>> >>> > >
>>> >>> > > Thanks for the reply.
>>> >>> > >
>>> >>> > > Do you know of any regex online tool that works correctly for
>>> Java
>>> >>> regex?
>>> >>> > > I tried to find some, but they are not working properly.
>>> >>> > >
>>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>>> >>> single \n
>>> >>> > > with single <br>.
>>> >>> > >
>>> >>> > > Regards,
>>> >>> > > Edwin
>>> >>> > >
>>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfranke@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > >>
>>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
>>> would
>>> >>> then
>>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> >>> regex
>>> >>> > for
>>> >>> > >> your solution.
>>> >>> > >>
>>> >>> > >> I believe you want to have 2 regex process factories:
>>> >>> > >> One that deals with single \n and one that deals with more than
>>> one
>>> >>> \n
>>> >>> > >>
>>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>> :
>>> >>> > >>>
>>> >>> > >>> Hi,
>>> >>> > >>>
>>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> >>> > >>> configuration:
>>> >>> > >>>
>>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>  <str name="fieldName">content</str>
>>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>> >>> > >>> </processor>
>>> >>> > >>>
>>> >>> > >>> However, the issue is still occurring.
>>> >>> > >>>
>>> >>> > >>> Anyone else is able to help?
>>> >>> > >>>
>>> >>> > >>> Regards,
>>> >>> > >>> Edwin
>>> >>> > >>>
>>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com>
>>> >>> > >>> wrote:
>>> >>> > >>>
>>> >>> > >>>> Hi,
>>> >>> > >>>>
>>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>> >>> > >>>>
>>> >>> > >>>> Regards,
>>> >>> > >>>> Edwin
>>> >>> > >>>>
>>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>>
>>> >>> > >>>> wrote:
>>> >>> > >>>>
>>> >>> > >>>>> Hi,
>>> >>> > >>>>>
>>> >>> > >>>>> Should we report this as a bug in Solr?
>>> >>> > >>>>>
>>> >>> > >>>>> Regards,
>>> >>> > >>>>> Edwin
>>> >>> > >>>>>
>>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> >>> > edwinyeozl@gmail.com
>>> >>> > >>>
>>> >>> > >>>>> wrote:
>>> >>> > >>>>>
>>> >>> > >>>>>> Hi Paul,
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
>>> try
>>> >>> in on
>>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
>>> >>> result for
>>> >>> > >> all
>>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
>>> not
>>> >>> more
>>> >>> > >> than
>>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>> examples).
>>> >>> > >>>>>>
>>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regards,
>>> >>> > >>>>>> Edwin
>>> >>> > >>>>>>
>>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> >>> > >> edwinyeozl@gmail.com>
>>> >>> > >>>>>> wrote:
>>> >>> > >>>>>>
>>> >>> > >>>>>>> Hi Paul,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>>> >>> pattern:
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>> </processor>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> However, we are also getting the exact same results as the
>>> >>> earlier
>>> >>> > >>>>>>> Example 1, 2 and 3.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
>>> (non
>>> >>> > >>>>>>> printing) characters than \n, we have find that there are
>>> no
>>> >>> non
>>> >>> > >> printing
>>> >>> > >>>>>>> characters. It is just next line with a space. You can
>>> refer
>>> >>> to the
>>> >>> > >>>>>>> original content in the same examples below.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> >>> > >>>>>>> correctly
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>> Dear Sir,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> I am terminating
>>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *exalted*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *Psalm 89:17*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n  3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> <br><br>3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >> \n
>>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> \n\n\n  On
>>> >>> Tue,
>>> >>> > >> Dec 18,
>>> >>> > >>>>>>> 2018 at 10:07 AM
>>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
>>> have.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Thank you.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Regards,
>>> >>> > >>>>>>> Edwin
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>>> >>> the \n
>>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>>> >>> characters
>>> >>> > >>>>>>>> than \n?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >> für
>>> >>> > >>>>>>>> Windows 10
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org>
>>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple \n
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>> </processor>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
>>> and
>>> >>> 3
>>> >>> > >> below.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n
>>> >>> > 3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >>>>>>>> \n \n\n
>>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue, Dec
>>> >>> > >> 18,
>>> >>> > >>>>>>>> 2018
>>> >>> > >>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Any further suggestion?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Thank you.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Regards,
>>> >>> > >>>>>>>> Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>>> failing
>>> >>> on
>>> >>> > the
>>> >>> > >>>>>>>> {2,}
>>> >>> > >>>>>>>>> part you could try
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> If you also want to match CRLF then
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> >>> > >
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org
>>> >>> > >>>
>>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thanks for your reply.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> When I use this pattern:
>>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>> </processor>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> It is working for some sentence within the same content
>>> and
>>> >>> not
>>> >>> > >>>>>>>> working for
>>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>>> working
>>> >>> and
>>> >>> > >>>>>>>> another
>>> >>> > >>>>>>>>> that is not working (partially working):
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>> >  \n\n  3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> > >> \n\n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n\n
>>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue,
>>> >>> > Dec
>>> >>> > >>>>>>>> 18, 2018
>>> >>> > >>>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> >>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thank you.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Regards,
>>> >>> > >>>>>>>>> Edwin
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>> working. I
>>> >>> > assume
>>> >>> > >>>>>>>> nothing
>>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> ??
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Gesendet von Mail<
>>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > >> solr-user@lucene.apache.org
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>>> >>> multiple
>>> >>> > >> \n
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Hi,
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>>> >>> remove
>>> >>> > more
>>> >>> > >>>>>>>> than
>>> >>> > >>>>>>>>> two
>>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
>>> \n,
>>> >>> \n
>>> >>> > \n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n),
>>> >>> > >>>>>>>>>> and replace it with two <br>.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
>>> when I
>>> >>> test
>>> >>> > it
>>> >>> > >>>>>>>> in
>>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
>>> inside
>>> >>> the
>>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>>> </processor>
>>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> >>> instructing
>>> >>> > the
>>> >>> > >>>>>>>> regex
>>> >>> > >>>>>>>>> to
>>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>> instructing
>>> >>> the
>>> >>> > >>>>>>>> regex to
>>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
>>> I do
>>> >>> it?
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Regards,
>>> >>> > >>>>>>>>>> Edwin
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Further to my previous email, which there was an extra "}" in the
configuration, I have changed to use the below configuration based on your
suggestion.

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t]*\r?\n</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the result that I get still has more than 2 <br>. In fact, the
result become worse, as you can see from the comparison below.

Example 1: The sentence that the regex pattern used to work correctly. But
with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
which is wrong.
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Previous Index content: *    Dear Sir,  <br><br>I am terminating
*Current Index content*:   Dear Sir, <br><br><br><br><br> I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
<br><br>3 Choa Chu Kang Avenue 4, Singapore
*Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
Choa Chu Kang Avenue 3, Singapor4

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
there are now 5 <br>
*Original content in EML file:*

http://www.concorded.com/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
10:07 AM
*Previous Index content: *http://www.concorded.com/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM
*Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
On Tue, Dec 18, 2018 at 10:07 AM


Regards,
Edwin

On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Thank you for the reply.
>
> I have tried to add the following configuration according to your
> suggestion:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t]*\r?\n}</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, none of the \n is being removed this time round.
> Is the order and/or the pattern correct?
>
> Regards,
> Edwin
>
> On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>> Try for the first pattern/replacement
>>
>>
>>
>> <str name="pattern">[ \t]*\r?\n</str>
>>
>> <str name="replacement">&lt;br&gt;</str>
>>
>>
>>
>> Now all line endings and preceding whitespace characters should be
>> changed to ‘<br>’.
>>
>>
>>
>> The second pattern replacement should replace 3 or more ‘<br>’ sequences
>> to 2 ‘<br>’ sequences:
>>
>>
>>
>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>
>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>
>>
>>
>> Hope this approach works. Sorry for not replying earlier and best regards,
>>
>> Paul
>>
>>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> Gesendet: Dienstag, 5. März 2019 03:35
>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi,
>>
>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>
>> Regards,
>> Edwin
>>
>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Anyone else has other suggestions or have faced the same problem?
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> If I tried to execute the second step first, then I will only get a
>> >> single <br> for those with 2 <br>.
>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
>> >> space in between.
>> >>
>> >> This is just changing the 2 <br> to be a single <br>, since the second
>> >> step is to replace with a single <br>.
>> >> But it has not solved the underlying problem yet.
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>> >>
>> >>> If the second step is executed first, then you will get the unwanted 4
>> >>> <br>
>> >>>
>> >>>
>> >>>
>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> >>> Windows 10
>> >>>
>> >>>
>> >>>
>> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> >>>
>> >>>
>> >>>
>> >>> Hi Jörn ,
>> >>>
>> >>> Do you mean the regex is not correct?
>> >>>
>> >>> We are already using two RegexReplaceProcessorFactory steps, like the
>> one
>> >>> shown below. The output that we get is still the same.
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>      <str name="fieldName">content</str>
>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>>      <bool name="literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>>      <str name="fieldName">content</str>
>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>> >>>      <str name="replacement">&lt;br&gt;</str>
>> >>>      <bool name="literalReplacement">true</bool>
>> >>> <processor>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
>> wrote:
>> >>>
>> >>> > Then you need two regexprocessfactory steps
>> >>> >
>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> >>> edwinyeozl@gmail.com
>> >>> > >:
>> >>> > >
>> >>> > > Hi,
>> >>> > >
>> >>> > > Thanks for the reply.
>> >>> > >
>> >>> > > Do you know of any regex online tool that works correctly for Java
>> >>> regex?
>> >>> > > I tried to find some, but they are not working properly.
>> >>> > >
>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>> >>> single \n
>> >>> > > with single <br>.
>> >>> > >
>> >>> > > Regards,
>> >>> > > Edwin
>> >>> > >
>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
>> >>> wrote:
>> >>> > >>
>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
>> would
>> >>> then
>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>> >>> regex
>> >>> > for
>> >>> > >> your solution.
>> >>> > >>
>> >>> > >> I believe you want to have 2 regex process factories:
>> >>> > >> One that deals with single \n and one that deals with more than
>> one
>> >>> \n
>> >>> > >>
>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>> :
>> >>> > >>>
>> >>> > >>> Hi,
>> >>> > >>>
>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>> >>> > >>> configuration:
>> >>> > >>>
>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>  <str name="fieldName">content</str>
>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>  <bool name="literalReplacement">true</bool>
>> >>> > >>> </processor>
>> >>> > >>>
>> >>> > >>> However, the issue is still occurring.
>> >>> > >>>
>> >>> > >>> Anyone else is able to help?
>> >>> > >>>
>> >>> > >>> Regards,
>> >>> > >>> Edwin
>> >>> > >>>
>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com>
>> >>> > >>> wrote:
>> >>> > >>>
>> >>> > >>>> Hi,
>> >>> > >>>>
>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> >>> > >>>>
>> >>> > >>>> Regards,
>> >>> > >>>> Edwin
>> >>> > >>>>
>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>> wrote:
>> >>> > >>>>
>> >>> > >>>>> Hi,
>> >>> > >>>>>
>> >>> > >>>>> Should we report this as a bug in Solr?
>> >>> > >>>>>
>> >>> > >>>>> Regards,
>> >>> > >>>>> Edwin
>> >>> > >>>>>
>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> >>> > edwinyeozl@gmail.com
>> >>> > >>>
>> >>> > >>>>> wrote:
>> >>> > >>>>>
>> >>> > >>>>>> Hi Paul,
>> >>> > >>>>>>
>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
>> try
>> >>> in on
>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
>> >>> result for
>> >>> > >> all
>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
>> not
>> >>> more
>> >>> > >> than
>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>> examples).
>> >>> > >>>>>>
>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>> >>> > >>>>>>
>> >>> > >>>>>> Regards,
>> >>> > >>>>>> Edwin
>> >>> > >>>>>>
>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> >>> > >> edwinyeozl@gmail.com>
>> >>> > >>>>>> wrote:
>> >>> > >>>>>>
>> >>> > >>>>>>> Hi Paul,
>> >>> > >>>>>>>
>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>> >>> pattern:
>> >>> > >>>>>>>
>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>> </processor>
>> >>> > >>>>>>>
>> >>> > >>>>>>> However, we are also getting the exact same results as the
>> >>> earlier
>> >>> > >>>>>>> Example 1, 2 and 3.
>> >>> > >>>>>>>
>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
>> (non
>> >>> > >>>>>>> printing) characters than \n, we have find that there are no
>> >>> non
>> >>> > >> printing
>> >>> > >>>>>>> characters. It is just next line with a space. You can refer
>> >>> to the
>> >>> > >>>>>>> original content in the same examples below.
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
>> working
>> >>> > >>>>>>> correctly
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>> Dear Sir,
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> I am terminating
>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> terminating
>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *exalted*
>> >>> > >>>>>>>
>> >>> > >>>>>>> *Psalm 89:17*
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >>>  \n\n  3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> <br><br>3
>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>
>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>> <br>)
>> >>> > >>>>>>> *Original content in EML file:*
>> >>> > >>>>>>>
>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >> \n
>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
>> On
>> >>> Tue,
>> >>> > >> Dec 18,
>> >>> > >>>>>>> 2018 at 10:07 AM
>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>
>> >>> > >>>>>>>
>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Thank you.
>> >>> > >>>>>>>
>> >>> > >>>>>>> Regards,
>> >>> > >>>>>>> Edwin
>> >>> > >>>>>>>
>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>> >>> the \n
>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>> >>> characters
>> >>> > >>>>>>>> than \n?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>> > >> für
>> >>> > >>>>>>>> Windows 10
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org>
>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> >>> > >> multiple \n
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Hi Paul,
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>> </processor>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
>> and
>> >>> 3
>> >>> > >> below.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >>>  \n\n
>> >>> > 3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> >  \n\n
>> >>> > >>>>>>>> \n \n\n
>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>> >>> Tue, Dec
>> >>> > >> 18,
>> >>> > >>>>>>>> 2018
>> >>> > >>>>>>>> at 10:07 AM
>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Any further suggestion?
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Thank you.
>> >>> > >>>>>>>>
>> >>> > >>>>>>>> Regards,
>> >>> > >>>>>>>> Edwin
>> >>> > >>>>>>>>
>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>> failing
>> >>> on
>> >>> > the
>> >>> > >>>>>>>> {2,}
>> >>> > >>>>>>>>> part you could try
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> If you also want to match CRLF then
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Gesendet von Mail<
>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>> >>> > >
>> >>> > >>>>>>>> für
>> >>> > >>>>>>>>> Windows 10
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > solr-user@lucene.apache.org
>> >>> > >>>
>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>> detect
>> >>> > >> multiple
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Hi Paul,
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thanks for your reply.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> When I use this pattern:
>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>> </processor>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> It is working for some sentence within the same content
>> and
>> >>> not
>> >>> > >>>>>>>> working for
>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>> working
>> >>> and
>> >>> > >>>>>>>> another
>> >>> > >>>>>>>>> that is not working (partially working):
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>> >>> working
>> >>> > >>>>>>>> correctly
>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> >>> terminating
>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >>> >  \n\n  3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> >>> > <br><br>3
>> >>> > >>>>>>>> Choa
>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>> >>> partially
>> >>> > >>>>>>>> working
>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>  \n\n
>> >>> > >> \n\n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n\n
>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>> >>> Tue,
>> >>> > Dec
>> >>> > >>>>>>>> 18, 2018
>> >>> > >>>>>>>>> at 10:07 AM
>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>> >>>  <br><br>
>> >>> > >>>>>>>> <br><br>On
>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Thank you.
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>> Regards,
>> >>> > >>>>>>>>> Edwin
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
>> wrote:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not working.
>> I
>> >>> > assume
>> >>> > >>>>>>>> nothing
>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> ??
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Gesendet von Mail<
>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> >>> > >>>>>>>> für
>> >>> > >>>>>>>>>> Windows 10
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> >>> > >> solr-user@lucene.apache.org
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>> >>> multiple
>> >>> > >> \n
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Hi,
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>> >>> remove
>> >>> > more
>> >>> > >>>>>>>> than
>> >>> > >>>>>>>>> two
>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
>> \n,
>> >>> \n
>> >>> > \n
>> >>> > >>>>>>>> \n
>> >>> > >>>>>>>>> \n),
>> >>> > >>>>>>>>>> and replace it with two <br>.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I use the following regex pattern and it is working when
>> I
>> >>> test
>> >>> > it
>> >>> > >>>>>>>> in
>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside
>> >>> the
>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> >>> > >>>>>>>>>> </processor>
>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>> >>> instructing
>> >>> > the
>> >>> > >>>>>>>> regex
>> >>> > >>>>>>>>> to
>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>> instructing
>> >>> the
>> >>> > >>>>>>>> regex to
>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I
>> do
>> >>> it?
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>> Regards,
>> >>> > >>>>>>>>>> Edwin
>> >>> > >>>>>>>>>>
>> >>> > >>>>>>>>>
>> >>> > >>>>>>>>
>> >>> > >>>>>>>
>> >>> > >>
>> >>> >
>> >>>
>> >>
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Thank you for the reply.

I have tried to add the following configuration according to your
suggestion:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">[ \t]*\r?\n}</str>
   <str name="replacement">&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, none of the \n is being removed this time round.
Is the order and/or the pattern correct?

Regards,
Edwin

On Tue, 5 Mar 2019 at 19:54, <pa...@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
> Try for the first pattern/replacement
>
>
>
> <str name="pattern">[ \t]*\r?\n</str>
>
> <str name="replacement">&lt;br&gt;</str>
>
>
>
> Now all line endings and preceding whitespace characters should be changed
> to ‘<br>’.
>
>
>
> The second pattern replacement should replace 3 or more ‘<br>’ sequences
> to 2 ‘<br>’ sequences:
>
>
>
> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>
> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>
>
>
> Hope this approach works. Sorry for not replying earlier and best regards,
>
> Paul
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Dienstag, 5. März 2019 03:35
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>
> Regards,
> Edwin
>
> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Anyone else has other suggestions or have faced the same problem?
> >
> > Regards,
> > Edwin
> >
> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi Paul,
> >>
> >> If I tried to execute the second step first, then I will only get a
> >> single <br> for those with 2 <br>.
> >> For those that we originally get 4 <br>, there will be 2 <br> with a
> >> space in between.
> >>
> >> This is just changing the 2 <br> to be a single <br>, since the second
> >> step is to replace with a single <br>.
> >> But it has not solved the underlying problem yet.
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
> >>
> >>> If the second step is executed first, then you will get the unwanted 4
> >>> <br>
> >>>
> >>>
> >>>
> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
> >>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >>>
> >>>
> >>>
> >>> Hi Jörn ,
> >>>
> >>> Do you mean the regex is not correct?
> >>>
> >>> We are already using two RegexReplaceProcessorFactory steps, like the
> one
> >>> shown below. The output that we get is still the same.
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>      <str name="fieldName">content</str>
> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>      <bool name="literalReplacement">true</bool>
> >>> <processor>
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>      <str name="fieldName">content</str>
> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
> >>>      <str name="replacement">&lt;br&gt;</str>
> >>>      <bool name="literalReplacement">true</bool>
> >>> <processor>
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com>
> wrote:
> >>>
> >>> > Then you need two regexprocessfactory steps
> >>> >
> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> >>> edwinyeozl@gmail.com
> >>> > >:
> >>> > >
> >>> > > Hi,
> >>> > >
> >>> > > Thanks for the reply.
> >>> > >
> >>> > > Do you know of any regex online tool that works correctly for Java
> >>> regex?
> >>> > > I tried to find some, but they are not working properly.
> >>> > >
> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
> >>> single \n
> >>> > > with single <br>.
> >>> > >
> >>> > > Regards,
> >>> > > Edwin
> >>> > >
> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
> >>> wrote:
> >>> > >>
> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
> would
> >>> then
> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java
> >>> regex
> >>> > for
> >>> > >> your solution.
> >>> > >>
> >>> > >> I believe you want to have 2 regex process factories:
> >>> > >> One that deals with single \n and one that deals with more than
> one
> >>> \n
> >>> > >>
> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> >>> > edwinyeozl@gmail.com
> >>> > >>> :
> >>> > >>>
> >>> > >>> Hi,
> >>> > >>>
> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> >>> > >>> configuration:
> >>> > >>>
> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> > >>>  <str name="fieldName">content</str>
> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> > >>>  <bool name="literalReplacement">true</bool>
> >>> > >>> </processor>
> >>> > >>>
> >>> > >>> However, the issue is still occurring.
> >>> > >>>
> >>> > >>> Anyone else is able to help?
> >>> > >>>
> >>> > >>> Regards,
> >>> > >>> Edwin
> >>> > >>>
> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> >>> > edwinyeozl@gmail.com>
> >>> > >>> wrote:
> >>> > >>>
> >>> > >>>> Hi,
> >>> > >>>>
> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>> > >>>>
> >>> > >>>> Regards,
> >>> > >>>> Edwin
> >>> > >>>>
> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> >>> > edwinyeozl@gmail.com
> >>> > >>>
> >>> > >>>> wrote:
> >>> > >>>>
> >>> > >>>>> Hi,
> >>> > >>>>>
> >>> > >>>>> Should we report this as a bug in Solr?
> >>> > >>>>>
> >>> > >>>>> Regards,
> >>> > >>>>> Edwin
> >>> > >>>>>
> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> >>> > edwinyeozl@gmail.com
> >>> > >>>
> >>> > >>>>> wrote:
> >>> > >>>>>
> >>> > >>>>>> Hi Paul,
> >>> > >>>>>>
> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try
> >>> in on
> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
> >>> result for
> >>> > >> all
> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and not
> >>> more
> >>> > >> than
> >>> > >>>>>> that like what we are getting in Solr in our earlier
> examples).
> >>> > >>>>>>
> >>> > >>>>>> Could there be a possibility of a bug in Solr?
> >>> > >>>>>>
> >>> > >>>>>> Regards,
> >>> > >>>>>> Edwin
> >>> > >>>>>>
> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >>> > >> edwinyeozl@gmail.com>
> >>> > >>>>>> wrote:
> >>> > >>>>>>
> >>> > >>>>>>> Hi Paul,
> >>> > >>>>>>>
> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
> >>> pattern:
> >>> > >>>>>>>
> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> > >>>>>>>  <str name="fieldName">content</str>
> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> > >>>>>>> </processor>
> >>> > >>>>>>>
> >>> > >>>>>>> However, we are also getting the exact same results as the
> >>> earlier
> >>> > >>>>>>> Example 1, 2 and 3.
> >>> > >>>>>>>
> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
> (non
> >>> > >>>>>>> printing) characters than \n, we have find that there are no
> >>> non
> >>> > >> printing
> >>> > >>>>>>> characters. It is just next line with a space. You can refer
> >>> to the
> >>> > >>>>>>> original content in the same examples below.
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
> working
> >>> > >>>>>>> correctly
> >>> > >>>>>>> *Original content in EML file:*
> >>> > >>>>>>> Dear Sir,
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> I am terminating
> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> terminating
> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> > >>>>>>>
> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>> *Original content in EML file:*
> >>> > >>>>>>>
> >>> > >>>>>>> *exalted*
> >>> > >>>>>>>
> >>> > >>>>>>> *Psalm 89:17*
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>>  \n\n  3
> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> <br><br>3
> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>> > >>>>>>>
> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>> *Original content in EML file:*
> >>> > >>>>>>>
> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >>> >  \n\n
> >>> > >> \n
> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n
> On
> >>> Tue,
> >>> > >> Dec 18,
> >>> > >>>>>>> 2018 at 10:07 AM
> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>  <br><br>
> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
> >>> > >>>>>>>
> >>> > >>>>>>> Thank you.
> >>> > >>>>>>>
> >>> > >>>>>>> Regards,
> >>> > >>>>>>> Edwin
> >>> > >>>>>>>
> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> >>> > >>>>>>>>
> >>> > >>>>>>>> Hi Edwin
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
> >>> the \n
> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
> >>> characters
> >>> > >>>>>>>> than \n?
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> Gesendet von Mail<
> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> > >> für
> >>> > >>>>>>>> Windows 10
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> > solr-user@lucene.apache.org>
> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >>> > >> multiple \n
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> Hi Paul,
> >>> > >>>>>>>>
> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> > >>>>>>>>  <str name="fieldName">content</str>
> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> > >>>>>>>> </processor>
> >>> > >>>>>>>>
> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
> and
> >>> 3
> >>> > >> below.
> >>> > >>>>>>>>
> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> working
> >>> > >>>>>>>> correctly
> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> terminating
> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> > >>>>>>>>
> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>>> working
> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>>  \n\n
> >>> > 3
> >>> > >>>>>>>> Choa
> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> > <br><br>3
> >>> > >>>>>>>> Choa
> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
> >>> > >>>>>>>>
> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>>> working
> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>  \n\n
> >>> >  \n\n
> >>> > >>>>>>>> \n \n\n
> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
> >>> Tue, Dec
> >>> > >> 18,
> >>> > >>>>>>>> 2018
> >>> > >>>>>>>> at 10:07 AM
> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>  <br><br>
> >>> > >>>>>>>> <br><br>On
> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> > >>>>>>>>
> >>> > >>>>>>>> Any further suggestion?
> >>> > >>>>>>>>
> >>> > >>>>>>>> Thank you.
> >>> > >>>>>>>>
> >>> > >>>>>>>> Regards,
> >>> > >>>>>>>> Edwin
> >>> > >>>>>>>>
> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch>
> wrote:
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing
> >>> on
> >>> > the
> >>> > >>>>>>>> {2,}
> >>> > >>>>>>>>> part you could try
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> If you also want to match CRLF then
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Gesendet von Mail<
> >>> https://go.microsoft.com/fwlink/?LinkId=550986
> >>> > >
> >>> > >>>>>>>> für
> >>> > >>>>>>>>> Windows 10
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> > solr-user@lucene.apache.org
> >>> > >>>
> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >>> > >> multiple
> >>> > >>>>>>>> \n
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Hi Paul,
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Thanks for your reply.
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> When I use this pattern:
> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> > >>>>>>>>>  <str name="fieldName">content</str>
> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> > >>>>>>>>> </processor>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> It is working for some sentence within the same content and
> >>> not
> >>> > >>>>>>>> working for
> >>> > >>>>>>>>> some sentences. Please see below for the one that is
> working
> >>> and
> >>> > >>>>>>>> another
> >>> > >>>>>>>>> that is not working (partially working):
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
> >>> working
> >>> > >>>>>>>> correctly
> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
> >>> terminating
> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>>> working
> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >>> >  \n\n  3
> >>> > >>>>>>>> Choa
> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> >>> > <br><br>3
> >>> > >>>>>>>> Choa
> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
> >>> partially
> >>> > >>>>>>>> working
> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>  \n\n
> >>> > >> \n\n
> >>> > >>>>>>>> \n
> >>> > >>>>>>>>> \n\n
> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
> >>> Tue,
> >>> > Dec
> >>> > >>>>>>>> 18, 2018
> >>> > >>>>>>>>> at 10:07 AM
> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
> >>>  <br><br>
> >>> > >>>>>>>> <br><br>On
> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Thank you.
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> Regards,
> >>> > >>>>>>>>> Edwin
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch>
> wrote:
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I
> >>> > assume
> >>> > >>>>>>>> nothing
> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> ??
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Gesendet von Mail<
> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
> >>> > >>>>>>>> für
> >>> > >>>>>>>>>> Windows 10
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >>> > >> solr-user@lucene.apache.org
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
> >>> multiple
> >>> > >> \n
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Hi,
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
> >>> remove
> >>> > more
> >>> > >>>>>>>> than
> >>> > >>>>>>>>> two
> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
> \n,
> >>> \n
> >>> > \n
> >>> > >>>>>>>> \n
> >>> > >>>>>>>>> \n),
> >>> > >>>>>>>>>> and replace it with two <br>.
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> I use the following regex pattern and it is working when I
> >>> test
> >>> > it
> >>> > >>>>>>>> in
> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside
> >>> the
> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>> > >>>>>>>>>> </processor>
> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
> >>> instructing
> >>> > the
> >>> > >>>>>>>> regex
> >>> > >>>>>>>>> to
> >>> > >>>>>>>>>> match any \n that have space after and {2,} is instructing
> >>> the
> >>> > >>>>>>>> regex to
> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I
> do
> >>> it?
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> I am using Solr 7.6.0.
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Regards,
> >>> > >>>>>>>>>> Edwin
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>
> >>> > >>
> >>> >
> >>>
> >>
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
Hi Edwin



Try for the first pattern/replacement



<str name="pattern">[ \t]*\r?\n</str>

<str name="replacement">&lt;br&gt;</str>



Now all line endings and preceding whitespace characters should be changed to ‘<br>’.



The second pattern replacement should replace 3 or more ‘<br>’ sequences to 2 ‘<br>’ sequences:



<str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>

<str name="replacement">&lt;br&gt;&lt;br&gt;</str>



Hope this approach works. Sorry for not replying earlier and best regards,

Paul





Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Dienstag, 5. März 2019 03:35
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

For your info, this issue is occurring in the new Solr 7.7.1 as well.

Regards,
Edwin

On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Anyone else has other suggestions or have faced the same problem?
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> If I tried to execute the second step first, then I will only get a
>> single <br> for those with 2 <br>.
>> For those that we originally get 4 <br>, there will be 2 <br> with a
>> space in between.
>>
>> This is just changing the 2 <br> to be a single <br>, since the second
>> step is to replace with a single <br>.
>> But it has not solved the underlying problem yet.
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>
>>> If the second step is executed first, then you will get the unwanted 4
>>> <br>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Jörn ,
>>>
>>> Do you mean the regex is not correct?
>>>
>>> We are already using two RegexReplaceProcessorFactory steps, like the one
>>> shown below. The output that we get is still the same.
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>      <str name="fieldName">content</str>
>>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>      <bool name="literalReplacement">true</bool>
>>> <processor>
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>      <str name="fieldName">content</str>
>>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>>      <str name="replacement">&lt;br&gt;</str>
>>>      <bool name="literalReplacement">true</bool>
>>> <processor>
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>> > Then you need two regexprocessfactory steps
>>> >
>>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> edwinyeozl@gmail.com
>>> > >:
>>> > >
>>> > > Hi,
>>> > >
>>> > > Thanks for the reply.
>>> > >
>>> > > Do you know of any regex online tool that works correctly for Java
>>> regex?
>>> > > I tried to find some, but they are not working properly.
>>> > >
>>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>>> single \n
>>> > > with single <br>.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>>> then
>>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> regex
>>> > for
>>> > >> your solution.
>>> > >>
>>> > >> I believe you want to have 2 regex process factories:
>>> > >> One that deals with single \n and one that deals with more than one
>>> \n
>>> > >>
>>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>> :
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> > >>> configuration:
>>> > >>>
>>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>  <str name="fieldName">content</str>
>>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>  <bool name="literalReplacement">true</bool>
>>> > >>> </processor>
>>> > >>>
>>> > >>> However, the issue is still occurring.
>>> > >>>
>>> > >>> Anyone else is able to help?
>>> > >>>
>>> > >>> Regards,
>>> > >>> Edwin
>>> > >>>
>>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>> > >>>>
>>> > >>>> Regards,
>>> > >>>> Edwin
>>> > >>>>
>>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> Hi,
>>> > >>>>>
>>> > >>>>> Should we report this as a bug in Solr?
>>> > >>>>>
>>> > >>>>> Regards,
>>> > >>>>> Edwin
>>> > >>>>>
>>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> Hi Paul,
>>> > >>>>>>
>>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try
>>> in on
>>> > >>>>>> https://regex101.com/, it is able to give us the correct
>>> result for
>>> > >> all
>>> > >>>>>> the examples (ie: All of them will only have <br><br>, and not
>>> more
>>> > >> than
>>> > >>>>>> that like what we are getting in Solr in our earlier examples).
>>> > >>>>>>
>>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> > >>>>>>
>>> > >>>>>> Regards,
>>> > >>>>>> Edwin
>>> > >>>>>>
>>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> > >> edwinyeozl@gmail.com>
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hi Paul,
>>> > >>>>>>>
>>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>>> pattern:
>>> > >>>>>>>
>>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>> </processor>
>>> > >>>>>>>
>>> > >>>>>>> However, we are also getting the exact same results as the
>>> earlier
>>> > >>>>>>> Example 1, 2 and 3.
>>> > >>>>>>>
>>> > >>>>>>> As for your point 2 on perhaps in the data you have other (non
>>> > >>>>>>> printing) characters than \n, we have find that there are no
>>> non
>>> > >> printing
>>> > >>>>>>> characters. It is just next line with a space. You can refer
>>> to the
>>> > >>>>>>> original content in the same examples below.
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> Example 1: The sentence that the above regex pattern is working
>>> > >>>>>>> correctly
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>> Dear Sir,
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> I am terminating
>>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>
>>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>>
>>> > >>>>>>> *exalted*
>>> > >>>>>>>
>>> > >>>>>>> *Psalm 89:17*
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>>  \n\n  3
>>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> <br><br>3
>>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>>>>>>
>>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>>
>>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> >  \n\n
>>> > >> \n
>>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue,
>>> > >> Dec 18,
>>> > >>>>>>> 2018 at 10:07 AM
>>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
>>> > >>>>>>>
>>> > >>>>>>> Thank you.
>>> > >>>>>>>
>>> > >>>>>>> Regards,
>>> > >>>>>>> Edwin
>>> > >>>>>>>
>>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>> Hi Edwin
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>>> the \n
>>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>>> characters
>>> > >>>>>>>> than \n?
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Gesendet von Mail<
>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >> für
>>> > >>>>>>>> Windows 10
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > solr-user@lucene.apache.org>
>>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> > >> multiple \n
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Hi Paul,
>>> > >>>>>>>>
>>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>> </processor>
>>> > >>>>>>>>
>>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and
>>> 3
>>> > >> below.
>>> > >>>>>>>>
>>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> > >>>>>>>> correctly
>>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>>
>>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>>  \n\n
>>> > 3
>>> > >>>>>>>> Choa
>>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> > <br><br>3
>>> > >>>>>>>> Choa
>>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>
>>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> >  \n\n
>>> > >>>>>>>> \n \n\n
>>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue, Dec
>>> > >> 18,
>>> > >>>>>>>> 2018
>>> > >>>>>>>> at 10:07 AM
>>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> > >>>>>>>> <br><br>On
>>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>>
>>> > >>>>>>>> Any further suggestion?
>>> > >>>>>>>>
>>> > >>>>>>>> Thank you.
>>> > >>>>>>>>
>>> > >>>>>>>> Regards,
>>> > >>>>>>>> Edwin
>>> > >>>>>>>>
>>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing
>>> on
>>> > the
>>> > >>>>>>>> {2,}
>>> > >>>>>>>>> part you could try
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> If you also want to match CRLF then
>>> > >>>>>>>>>
>>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Gesendet von Mail<
>>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> > >
>>> > >>>>>>>> für
>>> > >>>>>>>>> Windows 10
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > solr-user@lucene.apache.org
>>> > >>>
>>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> > >> multiple
>>> > >>>>>>>> \n
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Hi Paul,
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thanks for your reply.
>>> > >>>>>>>>>
>>> > >>>>>>>>> When I use this pattern:
>>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>>> </processor>
>>> > >>>>>>>>>
>>> > >>>>>>>>> It is working for some sentence within the same content and
>>> not
>>> > >>>>>>>> working for
>>> > >>>>>>>>> some sentences. Please see below for the one that is working
>>> and
>>> > >>>>>>>> another
>>> > >>>>>>>>> that is not working (partially working):
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> > >>>>>>>> correctly
>>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >  \n\n  3
>>> > >>>>>>>> Choa
>>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> > <br><br>3
>>> > >>>>>>>> Choa
>>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> > >> \n\n
>>> > >>>>>>>> \n
>>> > >>>>>>>>> \n\n
>>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue,
>>> > Dec
>>> > >>>>>>>> 18, 2018
>>> > >>>>>>>>> at 10:07 AM
>>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> > >>>>>>>> <br><br>On
>>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>>>
>>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thank you.
>>> > >>>>>>>>>
>>> > >>>>>>>>> Regards,
>>> > >>>>>>>>> Edwin
>>> > >>>>>>>>>
>>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I
>>> > assume
>>> > >>>>>>>> nothing
>>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> ??
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Gesendet von Mail<
>>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >>>>>>>> für
>>> > >>>>>>>>>> Windows 10
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > >> solr-user@lucene.apache.org
>>> > >>>>>>>>>
>>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>>> multiple
>>> > >> \n
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Hi,
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>>> remove
>>> > more
>>> > >>>>>>>> than
>>> > >>>>>>>>> two
>>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n,
>>> \n
>>> > \n
>>> > >>>>>>>> \n
>>> > >>>>>>>>> \n),
>>> > >>>>>>>>>> and replace it with two <br>.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I use the following regex pattern and it is working when I
>>> test
>>> > it
>>> > >>>>>>>> in
>>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside
>>> the
>>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>>>> </processor>
>>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> instructing
>>> > the
>>> > >>>>>>>> regex
>>> > >>>>>>>>> to
>>> > >>>>>>>>>> match any \n that have space after and {2,} is instructing
>>> the
>>> > >>>>>>>> regex to
>>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do
>>> it?
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Regards,
>>> > >>>>>>>>>> Edwin
>>> > >>>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>
>>> > >>
>>> >
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

For your info, this issue is occurring in the new Solr 7.7.1 as well.

Regards,
Edwin

On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Anyone else has other suggestions or have faced the same problem?
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> If I tried to execute the second step first, then I will only get a
>> single <br> for those with 2 <br>.
>> For those that we originally get 4 <br>, there will be 2 <br> with a
>> space in between.
>>
>> This is just changing the 2 <br> to be a single <br>, since the second
>> step is to replace with a single <br>.
>> But it has not solved the underlying problem yet.
>>
>> Regards,
>> Edwin
>>
>>
>> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>>
>>> If the second step is executed first, then you will get the unwanted 4
>>> <br>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Jörn ,
>>>
>>> Do you mean the regex is not correct?
>>>
>>> We are already using two RegexReplaceProcessorFactory steps, like the one
>>> shown below. The output that we get is still the same.
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>      <str name="fieldName">content</str>
>>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>      <bool name="literalReplacement">true</bool>
>>> <processor>
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>      <str name="fieldName">content</str>
>>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>>      <str name="replacement">&lt;br&gt;</str>
>>>      <bool name="literalReplacement">true</bool>
>>> <processor>
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>> > Then you need two regexprocessfactory steps
>>> >
>>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> edwinyeozl@gmail.com
>>> > >:
>>> > >
>>> > > Hi,
>>> > >
>>> > > Thanks for the reply.
>>> > >
>>> > > Do you know of any regex online tool that works correctly for Java
>>> regex?
>>> > > I tried to find some, but they are not working properly.
>>> > >
>>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>>> single \n
>>> > > with single <br>.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>>> then
>>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> regex
>>> > for
>>> > >> your solution.
>>> > >>
>>> > >> I believe you want to have 2 regex process factories:
>>> > >> One that deals with single \n and one that deals with more than one
>>> \n
>>> > >>
>>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>> :
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> > >>> configuration:
>>> > >>>
>>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>  <str name="fieldName">content</str>
>>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>  <bool name="literalReplacement">true</bool>
>>> > >>> </processor>
>>> > >>>
>>> > >>> However, the issue is still occurring.
>>> > >>>
>>> > >>> Anyone else is able to help?
>>> > >>>
>>> > >>> Regards,
>>> > >>> Edwin
>>> > >>>
>>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>> > >>>>
>>> > >>>> Regards,
>>> > >>>> Edwin
>>> > >>>>
>>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> Hi,
>>> > >>>>>
>>> > >>>>> Should we report this as a bug in Solr?
>>> > >>>>>
>>> > >>>>> Regards,
>>> > >>>>> Edwin
>>> > >>>>>
>>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> > edwinyeozl@gmail.com
>>> > >>>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> Hi Paul,
>>> > >>>>>>
>>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try
>>> in on
>>> > >>>>>> https://regex101.com/, it is able to give us the correct
>>> result for
>>> > >> all
>>> > >>>>>> the examples (ie: All of them will only have <br><br>, and not
>>> more
>>> > >> than
>>> > >>>>>> that like what we are getting in Solr in our earlier examples).
>>> > >>>>>>
>>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> > >>>>>>
>>> > >>>>>> Regards,
>>> > >>>>>> Edwin
>>> > >>>>>>
>>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> > >> edwinyeozl@gmail.com>
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hi Paul,
>>> > >>>>>>>
>>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>>> pattern:
>>> > >>>>>>>
>>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>> </processor>
>>> > >>>>>>>
>>> > >>>>>>> However, we are also getting the exact same results as the
>>> earlier
>>> > >>>>>>> Example 1, 2 and 3.
>>> > >>>>>>>
>>> > >>>>>>> As for your point 2 on perhaps in the data you have other (non
>>> > >>>>>>> printing) characters than \n, we have find that there are no
>>> non
>>> > >> printing
>>> > >>>>>>> characters. It is just next line with a space. You can refer
>>> to the
>>> > >>>>>>> original content in the same examples below.
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> Example 1: The sentence that the above regex pattern is working
>>> > >>>>>>> correctly
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>> Dear Sir,
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> I am terminating
>>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>
>>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>>
>>> > >>>>>>> *exalted*
>>> > >>>>>>>
>>> > >>>>>>> *Psalm 89:17*
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>>  \n\n  3
>>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> <br><br>3
>>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> > >>>>>>>
>>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>> *Original content in EML file:*
>>> > >>>>>>>
>>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> >  \n\n
>>> > >> \n
>>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue,
>>> > >> Dec 18,
>>> > >>>>>>> 2018 at 10:07 AM
>>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
>>> > >>>>>>>
>>> > >>>>>>> Thank you.
>>> > >>>>>>>
>>> > >>>>>>> Regards,
>>> > >>>>>>> Edwin
>>> > >>>>>>>
>>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>> Hi Edwin
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>>> the \n
>>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>>> characters
>>> > >>>>>>>> than \n?
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Gesendet von Mail<
>>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >> für
>>> > >>>>>>>> Windows 10
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > solr-user@lucene.apache.org>
>>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> > >> multiple \n
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Hi Paul,
>>> > >>>>>>>>
>>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>> </processor>
>>> > >>>>>>>>
>>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and
>>> 3
>>> > >> below.
>>> > >>>>>>>>
>>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> > >>>>>>>> correctly
>>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>>
>>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>>  \n\n
>>> > 3
>>> > >>>>>>>> Choa
>>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> > <br><br>3
>>> > >>>>>>>> Choa
>>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>
>>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> >  \n\n
>>> > >>>>>>>> \n \n\n
>>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue, Dec
>>> > >> 18,
>>> > >>>>>>>> 2018
>>> > >>>>>>>> at 10:07 AM
>>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> > >>>>>>>> <br><br>On
>>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>>
>>> > >>>>>>>> Any further suggestion?
>>> > >>>>>>>>
>>> > >>>>>>>> Thank you.
>>> > >>>>>>>>
>>> > >>>>>>>> Regards,
>>> > >>>>>>>> Edwin
>>> > >>>>>>>>
>>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing
>>> on
>>> > the
>>> > >>>>>>>> {2,}
>>> > >>>>>>>>> part you could try
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> If you also want to match CRLF then
>>> > >>>>>>>>>
>>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Gesendet von Mail<
>>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> > >
>>> > >>>>>>>> für
>>> > >>>>>>>>> Windows 10
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > solr-user@lucene.apache.org
>>> > >>>
>>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>>> > >> multiple
>>> > >>>>>>>> \n
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> Hi Paul,
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thanks for your reply.
>>> > >>>>>>>>>
>>> > >>>>>>>>> When I use this pattern:
>>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>>> </processor>
>>> > >>>>>>>>>
>>> > >>>>>>>>> It is working for some sentence within the same content and
>>> not
>>> > >>>>>>>> working for
>>> > >>>>>>>>> some sentences. Please see below for the one that is working
>>> and
>>> > >>>>>>>> another
>>> > >>>>>>>>> that is not working (partially working):
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> > >>>>>>>> correctly
>>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >  \n\n  3
>>> > >>>>>>>> Choa
>>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> > <br><br>3
>>> > >>>>>>>> Choa
>>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> > >>>>>>>>>
>>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>>> partially
>>> > >>>>>>>> working
>>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>>> > >> \n\n
>>> > >>>>>>>> \n
>>> > >>>>>>>>> \n\n
>>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> Tue,
>>> > Dec
>>> > >>>>>>>> 18, 2018
>>> > >>>>>>>>> at 10:07 AM
>>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> > >>>>>>>> <br><br>On
>>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> > >>>>>>>>>
>>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thank you.
>>> > >>>>>>>>>
>>> > >>>>>>>>> Regards,
>>> > >>>>>>>>> Edwin
>>> > >>>>>>>>>
>>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I
>>> > assume
>>> > >>>>>>>> nothing
>>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> ??
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Gesendet von Mail<
>>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> > >>>>>>>> für
>>> > >>>>>>>>>> Windows 10
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> > >> solr-user@lucene.apache.org
>>> > >>>>>>>>>
>>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>>> multiple
>>> > >> \n
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Hi,
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>>> remove
>>> > more
>>> > >>>>>>>> than
>>> > >>>>>>>>> two
>>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n,
>>> \n
>>> > \n
>>> > >>>>>>>> \n
>>> > >>>>>>>>> \n),
>>> > >>>>>>>>>> and replace it with two <br>.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I use the following regex pattern and it is working when I
>>> test
>>> > it
>>> > >>>>>>>> in
>>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside
>>> the
>>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > >>>>>>>>>> </processor>
>>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> instructing
>>> > the
>>> > >>>>>>>> regex
>>> > >>>>>>>>> to
>>> > >>>>>>>>>> match any \n that have space after and {2,} is instructing
>>> the
>>> > >>>>>>>> regex to
>>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do
>>> it?
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Regards,
>>> > >>>>>>>>>> Edwin
>>> > >>>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>
>>> > >>
>>> >
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

Anyone else has other suggestions or have faced the same problem?

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> If I tried to execute the second step first, then I will only get a single
> <br> for those with 2 <br>.
> For those that we originally get 4 <br>, there will be 2 <br> with a space
> in between.
>
> This is just changing the 2 <br> to be a single <br>, since the second
> step is to replace with a single <br>.
> But it has not solved the underlying problem yet.
>
> Regards,
> Edwin
>
>
> On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:
>
>> If the second step is executed first, then you will get the unwanted 4
>> <br>
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Jörn ,
>>
>> Do you mean the regex is not correct?
>>
>> We are already using two RegexReplaceProcessorFactory steps, like the one
>> shown below. The output that we get is still the same.
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>      <str name="fieldName">content</str>
>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>      <bool name="literalReplacement">true</bool>
>> <processor>
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>      <str name="fieldName">content</str>
>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>      <str name="replacement">&lt;br&gt;</str>
>>      <bool name="literalReplacement">true</bool>
>> <processor>
>>
>> Regards,
>> Edwin
>>
>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:
>>
>> > Then you need two regexprocessfactory steps
>> >
>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com
>> > >:
>> > >
>> > > Hi,
>> > >
>> > > Thanks for the reply.
>> > >
>> > > Do you know of any regex online tool that works correctly for Java
>> regex?
>> > > I tried to find some, but they are not working properly.
>> > >
>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>> single \n
>> > > with single <br>.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
>> wrote:
>> > >>
>> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
>> then
>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>> regex
>> > for
>> > >> your solution.
>> > >>
>> > >> I believe you want to have 2 regex process factories:
>> > >> One that deals with single \n and one that deals with more than one
>> \n
>> > >>
>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>> :
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>> > >>> configuration:
>> > >>>
>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>  <str name="fieldName">content</str>
>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>  <bool name="literalReplacement">true</bool>
>> > >>> </processor>
>> > >>>
>> > >>> However, the issue is still occurring.
>> > >>>
>> > >>> Anyone else is able to help?
>> > >>>
>> > >>> Regards,
>> > >>> Edwin
>> > >>>
>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> > >>>>
>> > >>>> Regards,
>> > >>>> Edwin
>> > >>>>
>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> Hi,
>> > >>>>>
>> > >>>>> Should we report this as a bug in Solr?
>> > >>>>>
>> > >>>>> Regards,
>> > >>>>> Edwin
>> > >>>>>
>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>> > edwinyeozl@gmail.com
>> > >>>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Hi Paul,
>> > >>>>>>
>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try
>> in on
>> > >>>>>> https://regex101.com/, it is able to give us the correct result
>> for
>> > >> all
>> > >>>>>> the examples (ie: All of them will only have <br><br>, and not
>> more
>> > >> than
>> > >>>>>> that like what we are getting in Solr in our earlier examples).
>> > >>>>>>
>> > >>>>>> Could there be a possibility of a bug in Solr?
>> > >>>>>>
>> > >>>>>> Regards,
>> > >>>>>> Edwin
>> > >>>>>>
>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> > >> edwinyeozl@gmail.com>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Hi Paul,
>> > >>>>>>>
>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>> pattern:
>> > >>>>>>>
>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>> </processor>
>> > >>>>>>>
>> > >>>>>>> However, we are also getting the exact same results as the
>> earlier
>> > >>>>>>> Example 1, 2 and 3.
>> > >>>>>>>
>> > >>>>>>> As for your point 2 on perhaps in the data you have other (non
>> > >>>>>>> printing) characters than \n, we have find that there are no non
>> > >> printing
>> > >>>>>>> characters. It is just next line with a space. You can refer to
>> the
>> > >>>>>>> original content in the same examples below.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> Example 1: The sentence that the above regex pattern is working
>> > >>>>>>> correctly
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>> Dear Sir,
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> I am terminating
>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> > >>>>>>>
>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>> partially
>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>>
>> > >>>>>>> *exalted*
>> > >>>>>>>
>> > >>>>>>> *Psalm 89:17*
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>  \n\n  3
>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> <br><br>3
>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>> > >>>>>>>
>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>> partially
>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>> *Original content in EML file:*
>> > >>>>>>>
>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>> >  \n\n
>> > >> \n
>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>> Tue,
>> > >> Dec 18,
>> > >>>>>>> 2018 at 10:07 AM
>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
>> > >>>>>>>
>> > >>>>>>> Thank you.
>> > >>>>>>>
>> > >>>>>>> Regards,
>> > >>>>>>> Edwin
>> > >>>>>>>
>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi Edwin
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the
>> \n
>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>> characters
>> > >>>>>>>> than \n?
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >> für
>> > >>>>>>>> Windows 10
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > solr-user@lucene.apache.org>
>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> > >> multiple \n
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hi Paul,
>> > >>>>>>>>
>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>> </processor>
>> > >>>>>>>>
>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
>> > >> below.
>> > >>>>>>>>
>> > >>>>>>>> Example 1: The sentence that the above regex pattern is working
>> > >>>>>>>> correctly
>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> > >>>>>>>>
>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>  \n\n
>> > 3
>> > >>>>>>>> Choa
>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> > <br><br>3
>> > >>>>>>>> Choa
>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>
>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>> >  \n\n
>> > >>>>>>>> \n \n\n
>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> Dec
>> > >> 18,
>> > >>>>>>>> 2018
>> > >>>>>>>> at 10:07 AM
>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> > >>>>>>>> <br><br>On
>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>>>
>> > >>>>>>>> Any further suggestion?
>> > >>>>>>>>
>> > >>>>>>>> Thank you.
>> > >>>>>>>>
>> > >>>>>>>> Regards,
>> > >>>>>>>> Edwin
>> > >>>>>>>>
>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on
>> > the
>> > >>>>>>>> {2,}
>> > >>>>>>>>> part you could try
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> If you also want to match CRLF then
>> > >>>>>>>>>
>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Gesendet von Mail<
>> https://go.microsoft.com/fwlink/?LinkId=550986
>> > >
>> > >>>>>>>> für
>> > >>>>>>>>> Windows 10
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > solr-user@lucene.apache.org
>> > >>>
>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> > >> multiple
>> > >>>>>>>> \n
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Hi Paul,
>> > >>>>>>>>>
>> > >>>>>>>>> Thanks for your reply.
>> > >>>>>>>>>
>> > >>>>>>>>> When I use this pattern:
>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>>> </processor>
>> > >>>>>>>>>
>> > >>>>>>>>> It is working for some sentence within the same content and
>> not
>> > >>>>>>>> working for
>> > >>>>>>>>> some sentences. Please see below for the one that is working
>> and
>> > >>>>>>>> another
>> > >>>>>>>>> that is not working (partially working):
>> > >>>>>>>>>
>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>> working
>> > >>>>>>>> correctly
>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>> terminating
>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>> > >>>>>>>>>
>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>> >  \n\n  3
>> > >>>>>>>> Choa
>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>> > <br><br>3
>> > >>>>>>>> Choa
>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>> > >>>>>>>>>
>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>> partially
>> > >>>>>>>> working
>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>> > >> \n\n
>> > >>>>>>>> \n
>> > >>>>>>>>> \n\n
>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> > Dec
>> > >>>>>>>> 18, 2018
>> > >>>>>>>>> at 10:07 AM
>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> > >>>>>>>> <br><br>On
>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>> > >>>>>>>>>
>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>> > >>>>>>>>>
>> > >>>>>>>>> Thank you.
>> > >>>>>>>>>
>> > >>>>>>>>> Regards,
>> > >>>>>>>>> Edwin
>> > >>>>>>>>>
>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>> > >>>>>>>>>>
>> > >>>>>>>>>> You don’t say what happens, just that it is not working. I
>> > assume
>> > >>>>>>>> nothing
>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> ??
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Gesendet von Mail<
>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>> > >>>>>>>> für
>> > >>>>>>>>>> Windows 10
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> > >> solr-user@lucene.apache.org
>> > >>>>>>>>>
>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>> multiple
>> > >> \n
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Hi,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
>> > more
>> > >>>>>>>> than
>> > >>>>>>>>> two
>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n,
>> \n
>> > \n
>> > >>>>>>>> \n
>> > >>>>>>>>> \n),
>> > >>>>>>>>>> and replace it with two <br>.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I use the following regex pattern and it is working when I
>> test
>> > it
>> > >>>>>>>> in
>> > >>>>>>>>>> regex101.com. But it is not working when I put it inside the
>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>> > >>>>>>>>>>
>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>> > >>>>>>>>>>  <str name="fieldName">content</str>
>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > >>>>>>>>>> </processor>
>> > >>>>>>>>>>         </updateRequestProcessorChain>
>> > >>>>>>>>>>
>> > >>>>>>>>>> To explain further about my regex pattern, \s* is instructing
>> > the
>> > >>>>>>>> regex
>> > >>>>>>>>> to
>> > >>>>>>>>>> match any \n that have space after and {2,} is instructing
>> the
>> > >>>>>>>> regex to
>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>> > >>>>>>>>>>
>> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do
>> it?
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am using Solr 7.6.0.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Regards,
>> > >>>>>>>>>> Edwin
>> > >>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>
>> >
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

If I tried to execute the second step first, then I will only get a single
<br> for those with 2 <br>.
For those that we originally get 4 <br>, there will be 2 <br> with a space
in between.

This is just changing the 2 <br> to be a single <br>, since the second step
is to replace with a single <br>.
But it has not solved the underlying problem yet.

Regards,
Edwin


On Wed, 20 Feb 2019 at 16:41, <pa...@ub.unibe.ch> wrote:

> If the second step is executed first, then you will get the unwanted 4 <br>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 09:29
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Jörn ,
>
> Do you mean the regex is not correct?
>
> We are already using two RegexReplaceProcessorFactory steps, like the one
> shown below. The output that we get is still the same.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>      <str name="fieldName">content</str>
>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>      <bool name="literalReplacement">true</bool>
> <processor>
>
> <processor class="solr.RegexReplaceProcessorFactory">
>      <str name="fieldName">content</str>
>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>      <str name="replacement">&lt;br&gt;</str>
>      <bool name="literalReplacement">true</bool>
> <processor>
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:
>
> > Then you need two regexprocessfactory steps
> >
> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > Thanks for the reply.
> > >
> > > Do you know of any regex online tool that works correctly for Java
> regex?
> > > I tried to find some, but they are not working properly.
> > >
> > > Yes, our plan is to replace more than one \n with <br><br>, and single
> \n
> > > with single <br>.
> > >
> > > Regards,
> > > Edwin
> > >
> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com>
> wrote:
> > >>
> > >> Solr uses Java regex matching, so i doubt there is a bug - it would
> then
> > >> be in the JDK. Try out in a regex online Tool that supports Java regex
> > for
> > >> your solution.
> > >>
> > >> I believe you want to have 2 regex process factories:
> > >> One that deals with single \n and one that deals with more than one \n
> > >>
> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>> :
> > >>>
> > >>> Hi,
> > >>>
> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > >>> configuration:
> > >>>
> > >>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>  <str name="fieldName">content</str>
> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>  <bool name="literalReplacement">true</bool>
> > >>> </processor>
> > >>>
> > >>> However, the issue is still occurring.
> > >>>
> > >>> Anyone else is able to help?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Should we report this as a bug in Solr?
> > >>>>>
> > >>>>> Regards,
> > >>>>> Edwin
> > >>>>>
> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Paul,
> > >>>>>>
> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in
> on
> > >>>>>> https://regex101.com/, it is able to give us the correct result
> for
> > >> all
> > >>>>>> the examples (ie: All of them will only have <br><br>, and not
> more
> > >> than
> > >>>>>> that like what we are getting in Solr in our earlier examples).
> > >>>>>>
> > >>>>>> Could there be a possibility of a bug in Solr?
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Edwin
> > >>>>>>
> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > >> edwinyeozl@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Paul,
> > >>>>>>>
> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
> pattern:
> > >>>>>>>
> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>> </processor>
> > >>>>>>>
> > >>>>>>> However, we are also getting the exact same results as the
> earlier
> > >>>>>>> Example 1, 2 and 3.
> > >>>>>>>
> > >>>>>>> As for your point 2 on perhaps in the data you have other (non
> > >>>>>>> printing) characters than \n, we have find that there are no non
> > >> printing
> > >>>>>>> characters. It is just next line with a space. You can refer to
> the
> > >>>>>>> original content in the same examples below.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>>>> correctly
> > >>>>>>> *Original content in EML file:*
> > >>>>>>> Dear Sir,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I am terminating
> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>>>
> > >>>>>>> Example 2: The sentence that the above regex pattern is partially
> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>> *Original content in EML file:*
> > >>>>>>>
> > >>>>>>> *exalted*
> > >>>>>>>
> > >>>>>>> *Psalm 89:17*
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 3 Choa Chu Kang Avenue 4
> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>  \n\n  3
> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>>>
> > >>>>>>> Example 3: The sentence that the above regex pattern is partially
> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>> *Original content in EML file:*
> > >>>>>>>
> > >>>>>>> http://www.concordpri.moe.edu.sg/
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >  \n\n
> > >> \n
> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
> Tue,
> > >> Dec 18,
> > >>>>>>> 2018 at 10:07 AM
> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Appreciate any other ideas or suggestions that you may have.
> > >>>>>>>
> > >>>>>>> Thank you.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Edwin
> > >>>>>>>
> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Edwin
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the
> \n
> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
> > >>>>>>>> than \n?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> > >> für
> > >>>>>>>> Windows 10
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org>
> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > >> multiple \n
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi Paul,
> > >>>>>>>>
> > >>>>>>>> We have tried this suggested regex pattern as follow:
> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>> </processor>
> > >>>>>>>>
> > >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
> > >> below.
> > >>>>>>>>
> > >>>>>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>>>>> correctly
> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>>>>
> > >>>>>>>> Example 2: The sentence that the above regex pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
> > 3
> > >>>>>>>> Choa
> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > <br><br>3
> > >>>>>>>> Choa
> > >>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>
> > >>>>>>>> Example 3: The sentence that the above regex pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >  \n\n
> > >>>>>>>> \n \n\n
> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec
> > >> 18,
> > >>>>>>>> 2018
> > >>>>>>>> at 10:07 AM
> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>>>> <br><br>On
> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>>
> > >>>>>>>> Any further suggestion?
> > >>>>>>>>
> > >>>>>>>> Thank you.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Edwin
> > >>>>>>>>
> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> > >>>>>>>>>
> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on
> > the
> > >>>>>>>> {2,}
> > >>>>>>>>> part you could try
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> If you also want to match CRLF then
> > >>>>>>>>>
> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986
> > >
> > >>>>>>>> für
> > >>>>>>>>> Windows 10
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org
> > >>>
> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > >> multiple
> > >>>>>>>> \n
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Paul,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for your reply.
> > >>>>>>>>>
> > >>>>>>>>> When I use this pattern:
> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>>> </processor>
> > >>>>>>>>>
> > >>>>>>>>> It is working for some sentence within the same content and not
> > >>>>>>>> working for
> > >>>>>>>>> some sentences. Please see below for the one that is working
> and
> > >>>>>>>> another
> > >>>>>>>>> that is not working (partially working):
> > >>>>>>>>>
> > >>>>>>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>>>>> correctly
> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>>>>>
> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
> >  \n\n  3
> > >>>>>>>> Choa
> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> > <br><br>3
> > >>>>>>>> Choa
> > >>>>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>>>
> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
> partially
> > >>>>>>>> working
> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> > >> \n\n
> > >>>>>>>> \n
> > >>>>>>>>> \n\n
> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> > Dec
> > >>>>>>>> 18, 2018
> > >>>>>>>>> at 10:07 AM
> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>>>> <br><br>On
> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>>>
> > >>>>>>>>> We would appreciate your help to see what is wrong?
> > >>>>>>>>>
> > >>>>>>>>> Thank you.
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Edwin
> > >>>>>>>>>
> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> You don’t say what happens, just that it is not working. I
> > assume
> > >>>>>>>> nothing
> > >>>>>>>>>> is replaced? Perhaps the pattern should be
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> ??
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Gesendet von Mail<
> > https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>>>>>>> für
> > >>>>>>>>>> Windows 10
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > >> solr-user@lucene.apache.org
> > >>>>>>>>>
> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
> multiple
> > >> \n
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
> > more
> > >>>>>>>> than
> > >>>>>>>>> two
> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n
> > \n
> > >>>>>>>> \n
> > >>>>>>>>> \n),
> > >>>>>>>>>> and replace it with two <br>.
> > >>>>>>>>>>
> > >>>>>>>>>> I use the following regex pattern and it is working when I
> test
> > it
> > >>>>>>>> in
> > >>>>>>>>>> regex101.com. But it is not working when I put it inside the
> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
> > >>>>>>>>>>
> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>>>  <str name="fieldName">content</str>
> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>>>> </processor>
> > >>>>>>>>>>         </updateRequestProcessorChain>
> > >>>>>>>>>>
> > >>>>>>>>>> To explain further about my regex pattern, \s* is instructing
> > the
> > >>>>>>>> regex
> > >>>>>>>>> to
> > >>>>>>>>>> match any \n that have space after and {2,} is instructing the
> > >>>>>>>> regex to
> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> > >>>>>>>>>>
> > >>>>>>>>>> Please kindly let me know what is wrong and how should I do
> it?
> > >>>>>>>>>>
> > >>>>>>>>>> I am using Solr 7.6.0.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Edwin
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>
> >
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
If the second step is executed first, then you will get the unwanted 4 <br>



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Mittwoch, 20. Februar 2019 09:29
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Jörn ,

Do you mean the regex is not correct?

We are already using two RegexReplaceProcessorFactory steps, like the one
shown below. The output that we get is still the same.

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){2,}</str>
     <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){1,}</str>
     <str name="replacement">&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:

> Then you need two regexprocessfactory steps
>
> > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > Thanks for the reply.
> >
> > Do you know of any regex online tool that works correctly for Java regex?
> > I tried to find some, but they are not working properly.
> >
> > Yes, our plan is to replace more than one \n with <br><br>, and single \n
> > with single <br>.
> >
> > Regards,
> > Edwin
> >
> >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:
> >>
> >> Solr uses Java regex matching, so i doubt there is a bug - it would then
> >> be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> >> your solution.
> >>
> >> I believe you want to have 2 regex process factories:
> >> One that deals with single \n and one that deals with more than one \n
> >>
> >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> >>> configuration:
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>  <str name="fieldName">content</str>
> >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>  <bool name="literalReplacement">true</bool>
> >>> </processor>
> >>>
> >>> However, the issue is still occurring.
> >>>
> >>> Anyone else is able to help?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Should we report this as a bug in Solr?
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>>>> https://regex101.com/, it is able to give us the correct result for
> >> all
> >>>>>> the examples (ie: All of them will only have <br><br>, and not more
> >> than
> >>>>>> that like what we are getting in Solr in our earlier examples).
> >>>>>>
> >>>>>> Could there be a possibility of a bug in Solr?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinyeozl@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
> >>>>>>>
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>  <str name="fieldName">content</str>
> >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> However, we are also getting the exact same results as the earlier
> >>>>>>> Example 1, 2 and 3.
> >>>>>>>
> >>>>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>>>> printing) characters than \n, we have find that there are no non
> >> printing
> >>>>>>> characters. It is just next line with a space. You can refer to the
> >>>>>>> original content in the same examples below.
> >>>>>>>
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>> correctly
> >>>>>>> *Original content in EML file:*
> >>>>>>> Dear Sir,
> >>>>>>>
> >>>>>>>
> >>>>>>> I am terminating
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> *exalted*
> >>>>>>>
> >>>>>>> *Psalm 89:17*
> >>>>>>>
> >>>>>>>
> >>>>>>> 3 Choa Chu Kang Avenue 4
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> http://www.concordpri.moe.edu.sg/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >> \n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> >> Dec 18,
> >>>>>>> 2018 at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>
> >>>>>>>> Hi Edwin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
> >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
> >>>>>>>> than \n?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org>
> >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >> multiple \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> We have tried this suggested regex pattern as follow:
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>
> >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
> >> below.
> >>>>>>>>
> >>>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>>> correctly
> >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>>
> >>>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
> 3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>
> >>>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >>>>>>>> \n \n\n
> >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> >> 18,
> >>>>>>>> 2018
> >>>>>>>> at 10:07 AM
> >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>
> >>>>>>>> Any further suggestion?
> >>>>>>>>
> >>>>>>>> Thank you.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>>
> >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on
> the
> >>>>>>>> {2,}
> >>>>>>>>> part you could try
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you also want to match CRLF then
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> >>>>>>>> für
> >>>>>>>>> Windows 10
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>
> >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >> multiple
> >>>>>>>> \n
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Paul,
> >>>>>>>>>
> >>>>>>>>> Thanks for your reply.
> >>>>>>>>>
> >>>>>>>>> When I use this pattern:
> >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>> </processor>
> >>>>>>>>>
> >>>>>>>>> It is working for some sentence within the same content and not
> >>>>>>>> working for
> >>>>>>>>> some sentences. Please see below for the one that is working and
> >>>>>>>> another
> >>>>>>>>> that is not working (partially working):
> >>>>>>>>>
> >>>>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>>> correctly
> >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>>>
> >>>>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>  \n\n  3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>>
> >>>>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >> \n\n
> >>>>>>>> \n
> >>>>>>>>> \n\n
> >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec
> >>>>>>>> 18, 2018
> >>>>>>>>> at 10:07 AM
> >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>>
> >>>>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>>>
> >>>>>>>>> Thank you.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Edwin
> >>>>>>>>>
> >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>>>
> >>>>>>>>>> You don’t say what happens, just that it is not working. I
> assume
> >>>>>>>> nothing
> >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ??
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>>>> für
> >>>>>>>>>> Windows 10
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >> solr-user@lucene.apache.org
> >>>>>>>>>
> >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
> >> \n
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
> more
> >>>>>>>> than
> >>>>>>>>> two
> >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n
> \n
> >>>>>>>> \n
> >>>>>>>>> \n),
> >>>>>>>>>> and replace it with two <br>.
> >>>>>>>>>>
> >>>>>>>>>> I use the following regex pattern and it is working when I test
> it
> >>>>>>>> in
> >>>>>>>>>> regex101.com. But it is not working when I put it inside the
> >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>>>
> >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>>> </processor>
> >>>>>>>>>>         </updateRequestProcessorChain>
> >>>>>>>>>>
> >>>>>>>>>> To explain further about my regex pattern, \s* is instructing
> the
> >>>>>>>> regex
> >>>>>>>>> to
> >>>>>>>>>> match any \n that have space after and {2,} is instructing the
> >>>>>>>> regex to
> >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>>>
> >>>>>>>>>> Please kindly let me know what is wrong and how should I do it?
> >>>>>>>>>>
> >>>>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Edwin
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Jörn ,

Do you mean the regex is not correct?

We are already using two RegexReplaceProcessorFactory steps, like the one
shown below. The output that we get is still the same.

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){2,}</str>
     <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

<processor class="solr.RegexReplaceProcessorFactory">
     <str name="fieldName">content</str>
     <str name="pattern">([ \t]*\r?\n){1,}</str>
     <str name="replacement">&lt;br&gt;</str>
     <bool name="literalReplacement">true</bool>
<processor>

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jo...@gmail.com> wrote:

> Then you need two regexprocessfactory steps
>
> > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > Thanks for the reply.
> >
> > Do you know of any regex online tool that works correctly for Java regex?
> > I tried to find some, but they are not working properly.
> >
> > Yes, our plan is to replace more than one \n with <br><br>, and single \n
> > with single <br>.
> >
> > Regards,
> > Edwin
> >
> >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:
> >>
> >> Solr uses Java regex matching, so i doubt there is a bug - it would then
> >> be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> >> your solution.
> >>
> >> I believe you want to have 2 regex process factories:
> >> One that deals with single \n and one that deals with more than one \n
> >>
> >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>> :
> >>>
> >>> Hi,
> >>>
> >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> >>> configuration:
> >>>
> >>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>  <str name="fieldName">content</str>
> >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
> >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>  <bool name="literalReplacement">true</bool>
> >>> </processor>
> >>>
> >>> However, the issue is still occurring.
> >>>
> >>> Anyone else is able to help?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Should we report this as a bug in Solr?
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>>>> https://regex101.com/, it is able to give us the correct result for
> >> all
> >>>>>> the examples (ie: All of them will only have <br><br>, and not more
> >> than
> >>>>>> that like what we are getting in Solr in our earlier examples).
> >>>>>>
> >>>>>> Could there be a possibility of a bug in Solr?
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinyeozl@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
> >>>>>>>
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>  <str name="fieldName">content</str>
> >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> However, we are also getting the exact same results as the earlier
> >>>>>>> Example 1, 2 and 3.
> >>>>>>>
> >>>>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>>>> printing) characters than \n, we have find that there are no non
> >> printing
> >>>>>>> characters. It is just next line with a space. You can refer to the
> >>>>>>> original content in the same examples below.
> >>>>>>>
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>> correctly
> >>>>>>> *Original content in EML file:*
> >>>>>>> Dear Sir,
> >>>>>>>
> >>>>>>>
> >>>>>>> I am terminating
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> *exalted*
> >>>>>>>
> >>>>>>> *Psalm 89:17*
> >>>>>>>
> >>>>>>>
> >>>>>>> 3 Choa Chu Kang Avenue 4
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content in EML file:*
> >>>>>>>
> >>>>>>> http://www.concordpri.moe.edu.sg/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >> \n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> >> Dec 18,
> >>>>>>> 2018 at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>
> >>>>>>>> Hi Edwin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
> >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
> >>>>>>>> than \n?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org>
> >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >> multiple \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> We have tried this suggested regex pattern as follow:
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>
> >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
> >> below.
> >>>>>>>>
> >>>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>>> correctly
> >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>>
> >>>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
> 3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>
> >>>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >>>>>>>> \n \n\n
> >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> >> 18,
> >>>>>>>> 2018
> >>>>>>>> at 10:07 AM
> >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>
> >>>>>>>> Any further suggestion?
> >>>>>>>>
> >>>>>>>> Thank you.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>>
> >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on
> the
> >>>>>>>> {2,}
> >>>>>>>>> part you could try
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you also want to match CRLF then
> >>>>>>>>>
> >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> >>>>>>>> für
> >>>>>>>>> Windows 10
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>
> >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> >> multiple
> >>>>>>>> \n
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Paul,
> >>>>>>>>>
> >>>>>>>>> Thanks for your reply.
> >>>>>>>>>
> >>>>>>>>> When I use this pattern:
> >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>> </processor>
> >>>>>>>>>
> >>>>>>>>> It is working for some sentence within the same content and not
> >>>>>>>> working for
> >>>>>>>>> some sentences. Please see below for the one that is working and
> >>>>>>>> another
> >>>>>>>>> that is not working (partially working):
> >>>>>>>>>
> >>>>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>>>> correctly
> >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>>>
> >>>>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>  \n\n  3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> >>>>>>>> Choa
> >>>>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>>>
> >>>>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>>>> working
> >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >> \n\n
> >>>>>>>> \n
> >>>>>>>>> \n\n
> >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec
> >>>>>>>> 18, 2018
> >>>>>>>>> at 10:07 AM
> >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>>>> <br><br>On
> >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>>>
> >>>>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>>>
> >>>>>>>>> Thank you.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Edwin
> >>>>>>>>>
> >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>>>
> >>>>>>>>>> You don’t say what happens, just that it is not working. I
> assume
> >>>>>>>> nothing
> >>>>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ??
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>>>> für
> >>>>>>>>>> Windows 10
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
> >> solr-user@lucene.apache.org
> >>>>>>>>>
> >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
> >> \n
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
> more
> >>>>>>>> than
> >>>>>>>>> two
> >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n
> \n
> >>>>>>>> \n
> >>>>>>>>> \n),
> >>>>>>>>>> and replace it with two <br>.
> >>>>>>>>>>
> >>>>>>>>>> I use the following regex pattern and it is working when I test
> it
> >>>>>>>> in
> >>>>>>>>>> regex101.com. But it is not working when I put it inside the
> >>>>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>>>
> >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>>>  <str name="fieldName">content</str>
> >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>>>> </processor>
> >>>>>>>>>>         </updateRequestProcessorChain>
> >>>>>>>>>>
> >>>>>>>>>> To explain further about my regex pattern, \s* is instructing
> the
> >>>>>>>> regex
> >>>>>>>>> to
> >>>>>>>>>> match any \n that have space after and {2,} is instructing the
> >>>>>>>> regex to
> >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>>>
> >>>>>>>>>> Please kindly let me know what is wrong and how should I do it?
> >>>>>>>>>>
> >>>>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Edwin
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Jörn Franke <jo...@gmail.com>.
Then you need two regexprocessfactory steps 

> Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> Thanks for the reply.
> 
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
> 
> Yes, our plan is to replace more than one \n with <br><br>, and single \n
> with single <br>.
> 
> Regards,
> Edwin
> 
>> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> Solr uses Java regex matching, so i doubt there is a bug - it would then
>> be in the JDK. Try out in a regex online Tool that supports Java regex for
>> your solution.
>> 
>> I believe you want to have 2 regex process factories:
>> One that deals with single \n and one that deals with more than one \n
>> 
>>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> :
>>> 
>>> Hi,
>>> 
>>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> configuration:
>>> 
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>  <str name="fieldName">content</str>
>>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>  <bool name="literalReplacement">true</bool>
>>> </processor>
>>> 
>>> However, the issue is still occurring.
>>> 
>>> Anyone else is able to help?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Should we report this as a bug in Solr?
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Paul,
>>>>>> 
>>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>>>>> https://regex101.com/, it is able to give us the correct result for
>> all
>>>>>> the examples (ie: All of them will only have <br><br>, and not more
>> than
>>>>>> that like what we are getting in Solr in our earlier examples).
>>>>>> 
>>>>>> Could there be a possibility of a bug in Solr?
>>>>>> 
>>>>>> Regards,
>>>>>> Edwin
>>>>>> 
>>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>> edwinyeozl@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>>>> 
>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>  <str name="fieldName">content</str>
>>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>> </processor>
>>>>>>> 
>>>>>>> However, we are also getting the exact same results as the earlier
>>>>>>> Example 1, 2 and 3.
>>>>>>> 
>>>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>>>> printing) characters than \n, we have find that there are no non
>> printing
>>>>>>> characters. It is just next line with a space. You can refer to the
>>>>>>> original content in the same examples below.
>>>>>>> 
>>>>>>> 
>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>> correctly
>>>>>>> *Original content in EML file:*
>>>>>>> Dear Sir,
>>>>>>> 
>>>>>>> 
>>>>>>> I am terminating
>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>> 
>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> *exalted*
>>>>>>> 
>>>>>>> *Psalm 89:17*
>>>>>>> 
>>>>>>> 
>>>>>>> 3 Choa Chu Kang Avenue 4
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content in EML file:*
>>>>>>> 
>>>>>>> http://www.concordpri.moe.edu.sg/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>> \n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
>> Dec 18,
>>>>>>> 2018 at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>>>>>>> 
>>>>>>>> Hi Edwin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>>>> 2.  Perhaps in the data you have other (non printing) characters
>>>>>>>> than \n?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Paul,
>>>>>>>> 
>>>>>>>> We have tried this suggested regex pattern as follow:
>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>> </processor>
>>>>>>>> 
>>>>>>>> But we still have exactly the same problem of Example 1,2 and 3
>> below.
>>>>>>>> 
>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>>> 
>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>> 
>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>>>>> \n \n\n
>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>> 18,
>>>>>>>> 2018
>>>>>>>> at 10:07 AM
>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>>> <br><br>On
>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>> 
>>>>>>>> Any further suggestion?
>>>>>>>> 
>>>>>>>> Thank you.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Edwin
>>>>>>>> 
>>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>>>>>>> 
>>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the
>>>>>>>> {2,}
>>>>>>>>> part you could try
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If you also want to match CRLF then
>>>>>>>>> 
>>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>> Windows 10
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>> 
>>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
>> multiple
>>>>>>>> \n
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Paul,
>>>>>>>>> 
>>>>>>>>> Thanks for your reply.
>>>>>>>>> 
>>>>>>>>> When I use this pattern:
>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>> </processor>
>>>>>>>>> 
>>>>>>>>> It is working for some sentence within the same content and not
>>>>>>>> working for
>>>>>>>>> some sentences. Please see below for the one that is working and
>>>>>>>> another
>>>>>>>>> that is not working (partially working):
>>>>>>>>> 
>>>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>>>> correctly
>>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>>>> 
>>>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>>>> Choa
>>>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>>>> 
>>>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>>>> working
>>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>> \n\n
>>>>>>>> \n
>>>>>>>>> \n\n
>>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>>>>>> 18, 2018
>>>>>>>>> at 10:07 AM
>>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>>>> <br><br>On
>>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>>>> 
>>>>>>>>> We would appreciate your help to see what is wrong?
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Edwin
>>>>>>>>> 
>>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>>>>>>>>> 
>>>>>>>>>> You don’t say what happens, just that it is not working. I assume
>>>>>>>> nothing
>>>>>>>>>> is replaced? Perhaps the pattern should be
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ??
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>>>> für
>>>>>>>>>> Windows 10
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>> solr-user@lucene.apache.org
>>>>>>>>> 
>>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more
>>>>>>>> than
>>>>>>>>> two
>>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>>>>>> \n
>>>>>>>>> \n),
>>>>>>>>>> and replace it with two <br>.
>>>>>>>>>> 
>>>>>>>>>> I use the following regex pattern and it is working when I test it
>>>>>>>> in
>>>>>>>>>> regex101.com. But it is not working when I put it inside the
>>>>>>>>>> RegexReplaceProcessorFactory as below:
>>>>>>>>>> 
>>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>>>  <str name="fieldName">content</str>
>>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>>>> </processor>
>>>>>>>>>>         </updateRequestProcessorChain>
>>>>>>>>>> 
>>>>>>>>>> To explain further about my regex pattern, \s* is instructing the
>>>>>>>> regex
>>>>>>>>> to
>>>>>>>>>> match any \n that have space after and {2,} is instructing the
>>>>>>>> regex to
>>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>>>>>>>> 
>>>>>>>>>> Please kindly let me know what is wrong and how should I do it?
>>>>>>>>>> 
>>>>>>>>>> I am using Solr 7.6.0.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Edwin
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>> 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

I am using Java 1.8.0_201.

Regards,
Edwin

On Wed, 20 Feb 2019 at 16:01, <pa...@ub.unibe.ch> wrote:

> BTW, which Java Version are you using?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Mittwoch, 20. Februar 2019 08:13
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> Thanks for the reply.
>
> Do you know of any regex online tool that works correctly for Java regex?
> I tried to find some, but they are not working properly.
>
> Yes, our plan is to replace more than one \n with <br><br>, and single \n
> with single <br>.
>
> Regards,
> Edwin
>
> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:
>
> > Solr uses Java regex matching, so i doubt there is a bug - it would then
> > be in the JDK. Try out in a regex online Tool that supports Java regex
> for
> > your solution.
> >
> > I believe you want to have 2 regex process factories:
> > One that deals with single \n and one that deals with more than one \n
> >
> > > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > > configuration:
> > >
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >   <str name="fieldName">content</str>
> > >   <str name="pattern">([ \t]*\r?\n){2,}</str>
> > >   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >   <bool name="literalReplacement">true</bool>
> > > </processor>
> > >
> > > However, the issue is still occurring.
> > >
> > > Anyone else is able to help?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> For your info, this issue is occurring in Solr 7.7.0 as well.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Should we report this as a bug in Solr?
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi Paul,
> > >>>>
> > >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> > >>>> https://regex101.com/, it is able to give us the correct result for
> > all
> > >>>> the examples (ie: All of them will only have <br><br>, and not more
> > than
> > >>>> that like what we are getting in Solr in our earlier examples).
> > >>>>
> > >>>> Could there be a possibility of a bug in Solr?
> > >>>>
> > >>>> Regards,
> > >>>> Edwin
> > >>>>
> > >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Paul,
> > >>>>>
> > >>>>> We have tried it with the space preceeding the \n i.e. <str
> > >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
> > >>>>>
> > >>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>   <str name="fieldName">content</str>
> > >>>>>   <str name="pattern">(\s*\n){2,}</str>
> > >>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>> </processor>
> > >>>>>
> > >>>>> However, we are also getting the exact same results as the earlier
> > >>>>> Example 1, 2 and 3.
> > >>>>>
> > >>>>> As for your point 2 on perhaps in the data you have other (non
> > >>>>> printing) characters than \n, we have find that there are no non
> > printing
> > >>>>> characters. It is just next line with a space. You can refer to the
> > >>>>> original content in the same examples below.
> > >>>>>
> > >>>>>
> > >>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>> correctly
> > >>>>> *Original content in EML file:*
> > >>>>> Dear Sir,
> > >>>>>
> > >>>>>
> > >>>>> I am terminating
> > >>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>
> > >>>>> Example 2: The sentence that the above regex pattern is partially
> > >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>> *Original content in EML file:*
> > >>>>>
> > >>>>> *exalted*
> > >>>>>
> > >>>>> *Psalm 89:17*
> > >>>>>
> > >>>>>
> > >>>>> 3 Choa Chu Kang Avenue 4
> > >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> > >>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> > >>>>> Choa Chu Kang Avenue 4, Singapore
> > >>>>>
> > >>>>> Example 3: The sentence that the above regex pattern is partially
> > >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>> *Original content in EML file:*
> > >>>>>
> > >>>>> http://www.concordpri.moe.edu.sg/
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> > \n
> > >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> > Dec 18,
> > >>>>> 2018 at 10:07 AM
> > >>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>
> > >>>>>
> > >>>>> Appreciate any other ideas or suggestions that you may have.
> > >>>>>
> > >>>>> Thank you.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Edwin
> > >>>>>
> > >>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> > >>>>>>
> > >>>>>> Hi Edwin
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>  1.  Sorry, the pattern was wrong, the space should preceed the \n
> > >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> > >>>>>>  2.  Perhaps in the data you have other (non printing) characters
> > >>>>>> than \n?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> > für
> > >>>>>> Windows 10
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> > >>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org>
> > >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > multiple \n
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Hi Paul,
> > >>>>>>
> > >>>>>> We have tried this suggested regex pattern as follow:
> > >>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>   <str name="fieldName">content</str>
> > >>>>>>   <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>> </processor>
> > >>>>>>
> > >>>>>> But we still have exactly the same problem of Example 1,2 and 3
> > below.
> > >>>>>>
> > >>>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>>> correctly
> > >>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>>
> > >>>>>> Example 2: The sentence that the above regex pattern is partially
> > >>>>>> working
> > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n
> 3
> > >>>>>> Choa
> > >>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> > >>>>>> Choa
> > >>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>
> > >>>>>> Example 3: The sentence that the above regex pattern is partially
> > >>>>>> working
> > >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> > >>>>>> \n \n\n
> > >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> > 18,
> > >>>>>> 2018
> > >>>>>> at 10:07 AM
> > >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>> <br><br>On
> > >>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>
> > >>>>>> Any further suggestion?
> > >>>>>>
> > >>>>>> Thank you.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Edwin
> > >>>>>>
> > >>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> > >>>>>>>
> > >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on
> the
> > >>>>>> {2,}
> > >>>>>>> part you could try
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> <str name="pattern">(\n\s*){2,}</str>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> If you also want to match CRLF then
> > >>>>>>>
> > >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986
> >
> > >>>>>> für
> > >>>>>>> Windows 10
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> > >>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> > >
> > >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> > multiple
> > >>>>>> \n
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hi Paul,
> > >>>>>>>
> > >>>>>>> Thanks for your reply.
> > >>>>>>>
> > >>>>>>> When I use this pattern:
> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>   <str name="fieldName">content</str>
> > >>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
> > >>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>> </processor>
> > >>>>>>>
> > >>>>>>> It is working for some sentence within the same content and not
> > >>>>>> working for
> > >>>>>>> some sentences. Please see below for the one that is working and
> > >>>>>> another
> > >>>>>>> that is not working (partially working):
> > >>>>>>>
> > >>>>>>> Example 1: The sentence that the above regex pattern is working
> > >>>>>> correctly
> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> > >>>>>>>
> > >>>>>>> Example 2: The sentence that the above regex pattern is partially
> > >>>>>> working
> > >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>  \n\n  3
> > >>>>>> Choa
> > >>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3
> > >>>>>> Choa
> > >>>>>>> Chu Kang Avenue 4, Singapore
> > >>>>>>>
> > >>>>>>> Example 3: The sentence that the above regex pattern is partially
> > >>>>>> working
> > >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
> >  \n\n
> > >>>>>> \n
> > >>>>>>> \n\n
> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec
> > >>>>>> 18, 2018
> > >>>>>>> at 10:07 AM
> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> > >>>>>> <br><br>On
> > >>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> > >>>>>>>
> > >>>>>>> We would appreciate your help to see what is wrong?
> > >>>>>>>
> > >>>>>>> Thank you.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Edwin
> > >>>>>>>
> > >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> > >>>>>>>>
> > >>>>>>>> You don’t say what happens, just that it is not working. I
> assume
> > >>>>>> nothing
> > >>>>>>>> is replaced? Perhaps the pattern should be
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> ??
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Gesendet von Mail<
> https://go.microsoft.com/fwlink/?LinkId=550986>
> > >>>>>> für
> > >>>>>>>> Windows 10
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> > solr-user@lucene.apache.org
> > >>>>>>>
> > >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
> > \n
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove
> more
> > >>>>>> than
> > >>>>>>> two
> > >>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n
> \n
> > >>>>>> \n
> > >>>>>>> \n),
> > >>>>>>>> and replace it with two <br>.
> > >>>>>>>>
> > >>>>>>>> I use the following regex pattern and it is working when I test
> it
> > >>>>>> in
> > >>>>>>>> regex101.com. But it is not working when I put it inside the
> > >>>>>>>> RegexReplaceProcessorFactory as below:
> > >>>>>>>>
> > >>>>>>>> <updateRequestProcessorChain name="removeCode">
> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> > >>>>>>>>   <str name="fieldName">content</str>
> > >>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
> > >>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > >>>>>>>> </processor>
> > >>>>>>>>          </updateRequestProcessorChain>
> > >>>>>>>>
> > >>>>>>>> To explain further about my regex pattern, \s* is instructing
> the
> > >>>>>> regex
> > >>>>>>> to
> > >>>>>>>> match any \n that have space after and {2,} is instructing the
> > >>>>>> regex to
> > >>>>>>>> match 2 or more occurrence of such pattern (\n).
> > >>>>>>>>
> > >>>>>>>> Please kindly let me know what is wrong and how should I do it?
> > >>>>>>>>
> > >>>>>>>> I am using Solr 7.6.0.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Edwin
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> >
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
BTW, which Java Version are you using?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Mittwoch, 20. Februar 2019 08:13
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

Thanks for the reply.

Do you know of any regex online tool that works correctly for Java regex?
I tried to find some, but they are not working properly.

Yes, our plan is to replace more than one \n with <br><br>, and single \n
with single <br>.

Regards,
Edwin

On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:

> Solr uses Java regex matching, so i doubt there is a bug - it would then
> be in the JDK. Try out in a regex online Tool that supports Java regex for
> your solution.
>
> I believe you want to have 2 regex process factories:
> One that deals with single \n and one that deals with more than one \n
>
> > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > configuration:
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >   <str name="fieldName">content</str>
> >   <str name="pattern">([ \t]*\r?\n){2,}</str>
> >   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >   <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the issue is still occurring.
> >
> > Anyone else is able to help?
> >
> > Regards,
> > Edwin
> >
> > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Should we report this as a bug in Solr?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Paul,
> >>>>
> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>> https://regex101.com/, it is able to give us the correct result for
> all
> >>>> the examples (ie: All of them will only have <br><br>, and not more
> than
> >>>> that like what we are getting in Solr in our earlier examples).
> >>>>
> >>>> Could there be a possibility of a bug in Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
> >>>>>
> >>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>   <str name="fieldName">content</str>
> >>>>>   <str name="pattern">(\s*\n){2,}</str>
> >>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>> </processor>
> >>>>>
> >>>>> However, we are also getting the exact same results as the earlier
> >>>>> Example 1, 2 and 3.
> >>>>>
> >>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>> printing) characters than \n, we have find that there are no non
> printing
> >>>>> characters. It is just next line with a space. You can refer to the
> >>>>> original content in the same examples below.
> >>>>>
> >>>>>
> >>>>> Example 1: The sentence that the above regex pattern is working
> >>>>> correctly
> >>>>> *Original content in EML file:*
> >>>>> Dear Sir,
> >>>>>
> >>>>>
> >>>>> I am terminating
> >>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>
> >>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> *exalted*
> >>>>>
> >>>>> *Psalm 89:17*
> >>>>>
> >>>>>
> >>>>> 3 Choa Chu Kang Avenue 4
> >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>
> >>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> http://www.concordpri.moe.edu.sg/
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
> \n
> >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec 18,
> >>>>> 2018 at 10:07 AM
> >>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>
> >>>>>
> >>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>
> >>>>> Thank you.
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> >>>>>>
> >>>>>> Hi Edwin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  1.  Sorry, the pattern was wrong, the space should preceed the \n
> >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>  2.  Perhaps in the data you have other (non printing) characters
> >>>>>> than \n?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> für
> >>>>>> Windows 10
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple \n
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> We have tried this suggested regex pattern as follow:
> >>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>   <str name="fieldName">content</str>
> >>>>>>   <str name="pattern">(\n\s*){2,}</str>
> >>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>> </processor>
> >>>>>>
> >>>>>> But we still have exactly the same problem of Example 1,2 and 3
> below.
> >>>>>>
> >>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>> correctly
> >>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>
> >>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>
> >>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
> >>>>>> \n \n\n
> >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> 18,
> >>>>>> 2018
> >>>>>> at 10:07 AM
> >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>
> >>>>>> Any further suggestion?
> >>>>>>
> >>>>>> Thank you.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> >>>>>>>
> >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the
> >>>>>> {2,}
> >>>>>>> part you could try
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If you also want to match CRLF then
> >>>>>>>
> >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>> Windows 10
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>>>>> \n
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> Thanks for your reply.
> >>>>>>>
> >>>>>>> When I use this pattern:
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>   <str name="fieldName">content</str>
> >>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> It is working for some sentence within the same content and not
> >>>>>> working for
> >>>>>>> some sentences. Please see below for the one that is working and
> >>>>>> another
> >>>>>>> that is not working (partially working):
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>> correctly
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >>>>>> \n
> >>>>>>> \n\n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> >>>>>> 18, 2018
> >>>>>>> at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>
> >>>>>>>> You don’t say what happens, just that it is not working. I assume
> >>>>>> nothing
> >>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ??
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>>>>>
> >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
> \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more
> >>>>>> than
> >>>>>>> two
> >>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
> >>>>>> \n
> >>>>>>> \n),
> >>>>>>>> and replace it with two <br>.
> >>>>>>>>
> >>>>>>>> I use the following regex pattern and it is working when I test it
> >>>>>> in
> >>>>>>>> regex101.com. But it is not working when I put it inside the
> >>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>
> >>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>   <str name="fieldName">content</str>
> >>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>          </updateRequestProcessorChain>
> >>>>>>>>
> >>>>>>>> To explain further about my regex pattern, \s* is instructing the
> >>>>>> regex
> >>>>>>> to
> >>>>>>>> match any \n that have space after and {2,} is instructing the
> >>>>>> regex to
> >>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>
> >>>>>>>> Please kindly let me know what is wrong and how should I do it?
> >>>>>>>>
> >>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

Thanks for the reply.

Do you know of any regex online tool that works correctly for Java regex?
I tried to find some, but they are not working properly.

Yes, our plan is to replace more than one \n with <br><br>, and single \n
with single <br>.

Regards,
Edwin

On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jo...@gmail.com> wrote:

> Solr uses Java regex matching, so i doubt there is a bug - it would then
> be in the JDK. Try out in a regex online Tool that supports Java regex for
> your solution.
>
> I believe you want to have 2 regex process factories:
> One that deals with single \n and one that deals with more than one \n
>
> > Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > We have tried with the following pattern ([ \t]*\r?\n){2,} and
> > configuration:
> >
> > <processor class="solr.RegexReplaceProcessorFactory">
> >   <str name="fieldName">content</str>
> >   <str name="pattern">([ \t]*\r?\n){2,}</str>
> >   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >   <bool name="literalReplacement">true</bool>
> > </processor>
> >
> > However, the issue is still occurring.
> >
> > Anyone else is able to help?
> >
> > Regards,
> > Edwin
> >
> > On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> For your info, this issue is occurring in Solr 7.7.0 as well.
> >>
> >> Regards,
> >> Edwin
> >>
> >> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Should we report this as a bug in Solr?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi Paul,
> >>>>
> >>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> >>>> https://regex101.com/, it is able to give us the correct result for
> all
> >>>> the examples (ie: All of them will only have <br><br>, and not more
> than
> >>>> that like what we are getting in Solr in our earlier examples).
> >>>>
> >>>> Could there be a possibility of a bug in Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Paul,
> >>>>>
> >>>>> We have tried it with the space preceeding the \n i.e. <str
> >>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
> >>>>>
> >>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>   <str name="fieldName">content</str>
> >>>>>   <str name="pattern">(\s*\n){2,}</str>
> >>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>> </processor>
> >>>>>
> >>>>> However, we are also getting the exact same results as the earlier
> >>>>> Example 1, 2 and 3.
> >>>>>
> >>>>> As for your point 2 on perhaps in the data you have other (non
> >>>>> printing) characters than \n, we have find that there are no non
> printing
> >>>>> characters. It is just next line with a space. You can refer to the
> >>>>> original content in the same examples below.
> >>>>>
> >>>>>
> >>>>> Example 1: The sentence that the above regex pattern is working
> >>>>> correctly
> >>>>> *Original content in EML file:*
> >>>>> Dear Sir,
> >>>>>
> >>>>>
> >>>>> I am terminating
> >>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>
> >>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> *exalted*
> >>>>>
> >>>>> *Psalm 89:17*
> >>>>>
> >>>>>
> >>>>> 3 Choa Chu Kang Avenue 4
> >>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>> Choa Chu Kang Avenue 4, Singapore
> >>>>>
> >>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>> *Original content in EML file:*
> >>>>>
> >>>>> http://www.concordpri.moe.edu.sg/
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Dec 18, 2018 at 10:07 AM
> >>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
> \n
> >>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue,
> Dec 18,
> >>>>> 2018 at 10:07 AM
> >>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> >>>>>
> >>>>>
> >>>>> Appreciate any other ideas or suggestions that you may have.
> >>>>>
> >>>>> Thank you.
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
> >>>>>>
> >>>>>> Hi Edwin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  1.  Sorry, the pattern was wrong, the space should preceed the \n
> >>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
> >>>>>>  2.  Perhaps in the data you have other (non printing) characters
> >>>>>> than \n?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> für
> >>>>>> Windows 10
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
> >>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> >>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple \n
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> We have tried this suggested regex pattern as follow:
> >>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>   <str name="fieldName">content</str>
> >>>>>>   <str name="pattern">(\n\s*){2,}</str>
> >>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>> </processor>
> >>>>>>
> >>>>>> But we still have exactly the same problem of Example 1,2 and 3
> below.
> >>>>>>
> >>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>> correctly
> >>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>
> >>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>> Choa
> >>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>
> >>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
> >>>>>> \n \n\n
> >>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> 18,
> >>>>>> 2018
> >>>>>> at 10:07 AM
> >>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>
> >>>>>> Any further suggestion?
> >>>>>>
> >>>>>> Thank you.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Edwin
> >>>>>>
> >>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
> >>>>>>>
> >>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the
> >>>>>> {2,}
> >>>>>>> part you could try
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> <str name="pattern">(\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If you also want to match CRLF then
> >>>>>>>
> >>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>> Windows 10
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
> >>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
> >
> >>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect
> multiple
> >>>>>> \n
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Paul,
> >>>>>>>
> >>>>>>> Thanks for your reply.
> >>>>>>>
> >>>>>>> When I use this pattern:
> >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>   <str name="fieldName">content</str>
> >>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
> >>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>> </processor>
> >>>>>>>
> >>>>>>> It is working for some sentence within the same content and not
> >>>>>> working for
> >>>>>>> some sentences. Please see below for the one that is working and
> >>>>>> another
> >>>>>>> that is not working (partially working):
> >>>>>>>
> >>>>>>> Example 1: The sentence that the above regex pattern is working
> >>>>>> correctly
> >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
> >>>>>>>
> >>>>>>> Example 2: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
> >>>>>> Choa
> >>>>>>> Chu Kang Avenue 4, Singapore
> >>>>>>>
> >>>>>>> Example 3: The sentence that the above regex pattern is partially
> >>>>>> working
> >>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
> >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n
>  \n\n
> >>>>>> \n
> >>>>>>> \n\n
> >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
> >>>>>> 18, 2018
> >>>>>>> at 10:07 AM
> >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> >>>>>> <br><br>On
> >>>>>>> Tue, Dec 18, 2018 at 10:07 AM
> >>>>>>>
> >>>>>>> We would appreciate your help to see what is wrong?
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>>
> >>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> >>>>>>>>
> >>>>>>>> You don’t say what happens, just that it is not working. I assume
> >>>>>> nothing
> >>>>>>>> is replaced? Perhaps the pattern should be
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ??
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
> >>>>>> für
> >>>>>>>> Windows 10
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
> >>>>>>>> An: solr-user@lucene.apache.org<mailto:
> solr-user@lucene.apache.org
> >>>>>>>
> >>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple
> \n
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more
> >>>>>> than
> >>>>>>> two
> >>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
> >>>>>> \n
> >>>>>>> \n),
> >>>>>>>> and replace it with two <br>.
> >>>>>>>>
> >>>>>>>> I use the following regex pattern and it is working when I test it
> >>>>>> in
> >>>>>>>> regex101.com. But it is not working when I put it inside the
> >>>>>>>> RegexReplaceProcessorFactory as below:
> >>>>>>>>
> >>>>>>>> <updateRequestProcessorChain name="removeCode">
> >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
> >>>>>>>>   <str name="fieldName">content</str>
> >>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
> >>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> >>>>>>>> </processor>
> >>>>>>>>          </updateRequestProcessorChain>
> >>>>>>>>
> >>>>>>>> To explain further about my regex pattern, \s* is instructing the
> >>>>>> regex
> >>>>>>> to
> >>>>>>>> match any \n that have space after and {2,} is instructing the
> >>>>>> regex to
> >>>>>>>> match 2 or more occurrence of such pattern (\n).
> >>>>>>>>
> >>>>>>>> Please kindly let me know what is wrong and how should I do it?
> >>>>>>>>
> >>>>>>>> I am using Solr 7.6.0.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Edwin
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Jörn Franke <jo...@gmail.com>.
Solr uses Java regex matching, so i doubt there is a bug - it would then be in the JDK. Try out in a regex online Tool that supports Java regex for your solution.

I believe you want to have 2 regex process factories:
One that deals with single \n and one that deals with more than one \n

> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> We have tried with the following pattern ([ \t]*\r?\n){2,} and
> configuration:
> 
> <processor class="solr.RegexReplaceProcessorFactory">
>   <str name="fieldName">content</str>
>   <str name="pattern">([ \t]*\r?\n){2,}</str>
>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>   <bool name="literalReplacement">true</bool>
> </processor>
> 
> However, the issue is still occurring.
> 
> Anyone else is able to help?
> 
> Regards,
> Edwin
> 
> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> For your info, this issue is occurring in Solr 7.7.0 as well.
>> 
>> Regards,
>> Edwin
>> 
>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> Should we report this as a bug in Solr?
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Paul,
>>>> 
>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>>> https://regex101.com/, it is able to give us the correct result for all
>>>> the examples (ie: All of them will only have <br><br>, and not more than
>>>> that like what we are getting in Solr in our earlier examples).
>>>> 
>>>> Could there be a possibility of a bug in Solr?
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>> We have tried it with the space preceeding the \n i.e. <str
>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>> 
>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>   <str name="fieldName">content</str>
>>>>>   <str name="pattern">(\s*\n){2,}</str>
>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> </processor>
>>>>> 
>>>>> However, we are also getting the exact same results as the earlier
>>>>> Example 1, 2 and 3.
>>>>> 
>>>>> As for your point 2 on perhaps in the data you have other (non
>>>>> printing) characters than \n, we have find that there are no non printing
>>>>> characters. It is just next line with a space. You can refer to the
>>>>> original content in the same examples below.
>>>>> 
>>>>> 
>>>>> Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> *Original content in EML file:*
>>>>> Dear Sir,
>>>>> 
>>>>> 
>>>>> I am terminating
>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>> 
>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content in EML file:*
>>>>> 
>>>>> *exalted*
>>>>> 
>>>>> *Psalm 89:17*
>>>>> 
>>>>> 
>>>>> 3 Choa Chu Kang Avenue 4
>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>> Choa Chu Kang Avenue 4, Singapore
>>>>> 
>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content in EML file:*
>>>>> 
>>>>> http://www.concordpri.moe.edu.sg/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>>> 2018 at 10:07 AM
>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>> 
>>>>> 
>>>>> Appreciate any other ideas or suggestions that you may have.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>>>>> 
>>>>>> Hi Edwin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>>  2.  Perhaps in the data you have other (non printing) characters
>>>>>> than \n?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>>>> Windows 10
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi Paul,
>>>>>> 
>>>>>> We have tried this suggested regex pattern as follow:
>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>   <str name="fieldName">content</str>
>>>>>>   <str name="pattern">(\n\s*){2,}</str>
>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>> </processor>
>>>>>> 
>>>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>>> 
>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>> correctly
>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>> 
>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>> Choa
>>>>>> Chu Kang Avenue 4, Singapore
>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>> Choa
>>>>>> Chu Kang Avenue 4, Singapore
>>>>>> 
>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>>> \n \n\n
>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>>>> 2018
>>>>>> at 10:07 AM
>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>> <br><br>On
>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>> 
>>>>>> Any further suggestion?
>>>>>> 
>>>>>> Thank you.
>>>>>> 
>>>>>> Regards,
>>>>>> Edwin
>>>>>> 
>>>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>>>>> 
>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on the
>>>>>> {2,}
>>>>>>> part you could try
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If you also want to match CRLF then
>>>>>>> 
>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>> für
>>>>>>> Windows 10
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>>>> \n
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Paul,
>>>>>>> 
>>>>>>> Thanks for your reply.
>>>>>>> 
>>>>>>> When I use this pattern:
>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>   <str name="fieldName">content</str>
>>>>>>>   <str name="pattern">(\n+\s*){2,}</str>
>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>> </processor>
>>>>>>> 
>>>>>>> It is working for some sentence within the same content and not
>>>>>> working for
>>>>>>> some sentences. Please see below for the one that is working and
>>>>>> another
>>>>>>> that is not working (partially working):
>>>>>>> 
>>>>>>> Example 1: The sentence that the above regex pattern is working
>>>>>> correctly
>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>>> 
>>>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>>> Choa
>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>>> Choa
>>>>>>> Chu Kang Avenue 4, Singapore
>>>>>>> 
>>>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>>> working
>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>>> \n
>>>>>>> \n\n
>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>>>> 18, 2018
>>>>>>> at 10:07 AM
>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>>> <br><br>On
>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>>> 
>>>>>>> We would appreciate your help to see what is wrong?
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>>> 
>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>>>>>>> 
>>>>>>>> You don’t say what happens, just that it is not working. I assume
>>>>>> nothing
>>>>>>>> is replaced? Perhaps the pattern should be
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   <str name="pattern">"(\n\s*){2,}"</str>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ??
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>>> für
>>>>>>>> Windows 10
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>>>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>>>>>> 
>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove more
>>>>>> than
>>>>>>> two
>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>>>> \n
>>>>>>> \n),
>>>>>>>> and replace it with two <br>.
>>>>>>>> 
>>>>>>>> I use the following regex pattern and it is working when I test it
>>>>>> in
>>>>>>>> regex101.com. But it is not working when I put it inside the
>>>>>>>> RegexReplaceProcessorFactory as below:
>>>>>>>> 
>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>>>>   <str name="fieldName">content</str>
>>>>>>>>   <str name="pattern">"(\\n\s*){2,}"</str>
>>>>>>>>   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>>>>> </processor>
>>>>>>>>          </updateRequestProcessorChain>
>>>>>>>> 
>>>>>>>> To explain further about my regex pattern, \s* is instructing the
>>>>>> regex
>>>>>>> to
>>>>>>>> match any \n that have space after and {2,} is instructing the
>>>>>> regex to
>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>>>>>>> 
>>>>>>>> Please kindly let me know what is wrong and how should I do it?
>>>>>>>> 
>>>>>>>> I am using Solr 7.6.0.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Edwin
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

We have tried with the following pattern ([ \t]*\r?\n){2,} and
configuration:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">([ \t]*\r?\n){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
</processor>

However, the issue is still occurring.

Anyone else is able to help?

Regards,
Edwin

On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> For your info, this issue is occurring in Solr 7.7.0 as well.
>
> Regards,
> Edwin
>
> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Should we report this as a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>>> https://regex101.com/, it is able to give us the correct result for all
>>> the examples (ie: All of them will only have <br><br>, and not more than
>>> that like what we are getting in Solr in our earlier examples).
>>>
>>> Could there be a possibility of a bug in Solr?
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> We have tried it with the space preceeding the \n i.e. <str
>>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>>
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(\s*\n){2,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> </processor>
>>>>
>>>> However, we are also getting the exact same results as the earlier
>>>> Example 1, 2 and 3.
>>>>
>>>> As for your point 2 on perhaps in the data you have other (non
>>>> printing) characters than \n, we have find that there are no non printing
>>>> characters. It is just next line with a space. You can refer to the
>>>> original content in the same examples below.
>>>>
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content in EML file:*
>>>> Dear Sir,
>>>>
>>>>
>>>> I am terminating
>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content in EML file:*
>>>>
>>>> *exalted*
>>>>
>>>> *Psalm 89:17*
>>>>
>>>>
>>>> 3 Choa Chu Kang Avenue 4
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>> Choa Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content in EML file:*
>>>>
>>>> http://www.concordpri.moe.edu.sg/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018 at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>>
>>>> Appreciate any other ideas or suggestions that you may have.
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>>>
>>>>> Hi Edwin
>>>>>
>>>>>
>>>>>
>>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>>   2.  Perhaps in the data you have other (non printing) characters
>>>>> than \n?
>>>>>
>>>>>
>>>>>
>>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>>> Windows 10
>>>>>
>>>>>
>>>>>
>>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>>
>>>>>
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> We have tried this suggested regex pattern as follow:
>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>>    <str name="fieldName">content</str>
>>>>>    <str name="pattern">(\n\s*){2,}</str>
>>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> </processor>
>>>>>
>>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>>
>>>>> Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>>
>>>>> Example 2: The sentence that the above regex pattern is partially
>>>>> working
>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa
>>>>> Chu Kang Avenue 4, Singapore
>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>> Choa
>>>>> Chu Kang Avenue 4, Singapore
>>>>>
>>>>> Example 3: The sentence that the above regex pattern is partially
>>>>> working
>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>> \n \n\n
>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>>> 2018
>>>>> at 10:07 AM
>>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On
>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>>
>>>>> Any further suggestion?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Edwin
>>>>>
>>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>>>
>>>>> > To avoid the «\n+\s*» matching too many \n and then failing on the
>>>>> {2,}
>>>>> > part you could try
>>>>> >
>>>>> >
>>>>> >
>>>>> > <str name="pattern">(\n\s*){2,}</str>
>>>>> >
>>>>> >
>>>>> >
>>>>> > If you also want to match CRLF then
>>>>> >
>>>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>> für
>>>>> > Windows 10
>>>>> >
>>>>> >
>>>>> >
>>>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>>> \n
>>>>> >
>>>>> >
>>>>> >
>>>>> > Hi Paul,
>>>>> >
>>>>> > Thanks for your reply.
>>>>> >
>>>>> > When I use this pattern:
>>>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>>>> >    <str name="fieldName">content</str>
>>>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> > </processor>
>>>>> >
>>>>> > It is working for some sentence within the same content and not
>>>>> working for
>>>>> > some sentences. Please see below for the one that is working and
>>>>> another
>>>>> > that is not working (partially working):
>>>>> >
>>>>> > Example 1: The sentence that the above regex pattern is working
>>>>> correctly
>>>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>> >
>>>>> > Example 2: The sentence that the above regex pattern is partially
>>>>> working
>>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>>> Choa
>>>>> > Chu Kang Avenue 4, Singapore
>>>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>>> Choa
>>>>> > Chu Kang Avenue 4, Singapore
>>>>> >
>>>>> > Example 3: The sentence that the above regex pattern is partially
>>>>> working
>>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>>> \n
>>>>> > \n\n
>>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>>> 18, 2018
>>>>> > at 10:07 AM
>>>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>>> <br><br>On
>>>>> > Tue, Dec 18, 2018 at 10:07 AM
>>>>> >
>>>>> > We would appreciate your help to see what is wrong?
>>>>> >
>>>>> > Thank you.
>>>>> >
>>>>> > Regards,
>>>>> > Edwin
>>>>> >
>>>>> > On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>>>> >
>>>>> > > You don’t say what happens, just that it is not working. I assume
>>>>> nothing
>>>>> > > is replaced? Perhaps the pattern should be
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > ??
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>>> für
>>>>> > > Windows 10
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>>> > > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org
>>>>> >
>>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > Hi,
>>>>> > >
>>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>>>> than
>>>>> > two
>>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>>> \n
>>>>> > \n),
>>>>> > > and replace it with two <br>.
>>>>> > >
>>>>> > > I use the following regex pattern and it is working when I test it
>>>>> in
>>>>> > > regex101.com. But it is not working when I put it inside the
>>>>> > > RegexReplaceProcessorFactory as below:
>>>>> > >
>>>>> > > <updateRequestProcessorChain name="removeCode">
>>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>>> > >    <str name="fieldName">content</str>
>>>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>>> > > </processor>
>>>>> > >           </updateRequestProcessorChain>
>>>>> > >
>>>>> > > To explain further about my regex pattern, \s* is instructing the
>>>>> regex
>>>>> > to
>>>>> > > match any \n that have space after and {2,} is instructing the
>>>>> regex to
>>>>> > > match 2 or more occurrence of such pattern (\n).
>>>>> > >
>>>>> > > Please kindly let me know what is wrong and how should I do it?
>>>>> > >
>>>>> > > I am using Solr 7.6.0.
>>>>> > >
>>>>> > > Regards,
>>>>> > > Edwin
>>>>> > >
>>>>> >
>>>>>
>>>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

For your info, this issue is occurring in Solr 7.7.0 as well.

Regards,
Edwin

On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> Should we report this as a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>> https://regex101.com/, it is able to give us the correct result for all
>> the examples (ie: All of them will only have <br><br>, and not more than
>> that like what we are getting in Solr in our earlier examples).
>>
>> Could there be a possibility of a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <ed...@gmail.com>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> We have tried it with the space preceeding the \n i.e. <str
>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\s*\n){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> However, we are also getting the exact same results as the earlier
>>> Example 1, 2 and 3.
>>>
>>> As for your point 2 on perhaps in the data you have other (non printing)
>>> characters than \n, we have find that there are no non printing characters.
>>> It is just next line with a space. You can refer to the original content in
>>> the same examples below.
>>>
>>>
>>> Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> *Original content in EML file:*
>>> Dear Sir,
>>>
>>>
>>> I am terminating
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> *exalted*
>>>
>>> *Psalm 89:17*
>>>
>>>
>>> 3 Choa Chu Kang Avenue 4
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>> Choa Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> http://www.concordpri.moe.edu.sg/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018 at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>
>>>
>>> Appreciate any other ideas or suggestions that you may have.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>   2.  Perhaps in the data you have other (non printing) characters than
>>>> \n?
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> We have tried this suggested regex pattern as follow:
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(\n\s*){2,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> </processor>
>>>>
>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018
>>>> at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>> Any further suggestion?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>>
>>>> > To avoid the «\n+\s*» matching too many \n and then failing on the
>>>> {2,}
>>>> > part you could try
>>>> >
>>>> >
>>>> >
>>>> > <str name="pattern">(\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> > If you also want to match CRLF then
>>>> >
>>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> > Windows 10
>>>> >
>>>> >
>>>> >
>>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>> \n
>>>> >
>>>> >
>>>> >
>>>> > Hi Paul,
>>>> >
>>>> > Thanks for your reply.
>>>> >
>>>> > When I use this pattern:
>>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>>> >    <str name="fieldName">content</str>
>>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > </processor>
>>>> >
>>>> > It is working for some sentence within the same content and not
>>>> working for
>>>> > some sentences. Please see below for the one that is working and
>>>> another
>>>> > that is not working (partially working):
>>>> >
>>>> > Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>>> >
>>>> > Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> >
>>>> > Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>> \n
>>>> > \n\n
>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>> 18, 2018
>>>> > at 10:07 AM
>>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> > Tue, Dec 18, 2018 at 10:07 AM
>>>> >
>>>> > We would appreciate your help to see what is wrong?
>>>> >
>>>> > Thank you.
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> > On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>>> >
>>>> > > You don’t say what happens, just that it is not working. I assume
>>>> nothing
>>>> > > is replaced? Perhaps the pattern should be
>>>> > >
>>>> > >
>>>> > >
>>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>>> > >
>>>> > >
>>>> > >
>>>> > > ??
>>>> > >
>>>> > >
>>>> > >
>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> für
>>>> > > Windows 10
>>>> > >
>>>> > >
>>>> > >
>>>> > > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>> > > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>> > >
>>>> > >
>>>> > >
>>>> > > Hi,
>>>> > >
>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>>> than
>>>> > two
>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>> \n
>>>> > \n),
>>>> > > and replace it with two <br>.
>>>> > >
>>>> > > I use the following regex pattern and it is working when I test it
>>>> in
>>>> > > regex101.com. But it is not working when I put it inside the
>>>> > > RegexReplaceProcessorFactory as below:
>>>> > >
>>>> > > <updateRequestProcessorChain name="removeCode">
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > > </processor>
>>>> > >           </updateRequestProcessorChain>
>>>> > >
>>>> > > To explain further about my regex pattern, \s* is instructing the
>>>> regex
>>>> > to
>>>> > > match any \n that have space after and {2,} is instructing the
>>>> regex to
>>>> > > match 2 or more occurrence of such pattern (\n).
>>>> > >
>>>> > > Please kindly let me know what is wrong and how should I do it?
>>>> > >
>>>> > > I am using Solr 7.6.0.
>>>> > >
>>>> > > Regards,
>>>> > > Edwin
>>>> > >
>>>> >
>>>>
>>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi,

Should we report this as a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> https://regex101.com/, it is able to give us the correct result for all
> the examples (ie: All of them will only have <br><br>, and not more than
> that like what we are getting in Solr in our earlier examples).
>
> Could there be a possibility of a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> We have tried it with the space preceeding the \n i.e. <str
>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\s*\n){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> However, we are also getting the exact same results as the earlier
>> Example 1, 2 and 3.
>>
>> As for your point 2 on perhaps in the data you have other (non printing)
>> characters than \n, we have find that there are no non printing characters.
>> It is just next line with a space. You can refer to the original content in
>> the same examples below.
>>
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content in EML file:*
>> Dear Sir,
>>
>>
>> I am terminating
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> *exalted*
>>
>> *Psalm 89:17*
>>
>>
>> 3 Choa Chu Kang Avenue 4
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> http://www.concordpri.moe.edu.sg/
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 18, 2018 at 10:07 AM
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018 at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>
>>
>> Appreciate any other ideas or suggestions that you may have.
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>>> <str name="pattern">(\s*\n){2,}</str>
>>>   2.  Perhaps in the data you have other (non printing) characters than
>>> \n?
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> We have tried this suggested regex pattern as follow:
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\n\s*){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>
>>> Example 1: The sentence that the above regex pattern is working correctly
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>> Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>>> Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> Tue, Dec 18, 2018 at 10:07 AM
>>>
>>> Any further suggestion?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>>
>>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>>> > part you could try
>>> >
>>> >
>>> >
>>> > <str name="pattern">(\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> > If you also want to match CRLF then
>>> >
>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > Thanks for your reply.
>>> >
>>> > When I use this pattern:
>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>> >    <str name="fieldName">content</str>
>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > </processor>
>>> >
>>> > It is working for some sentence within the same content and not
>>> working for
>>> > some sentences. Please see below for the one that is working and
>>> another
>>> > that is not working (partially working):
>>> >
>>> > Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >
>>> > Example 2: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> >
>>> > Example 3: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> > \n\n
>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> > at 10:07 AM
>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> > Tue, Dec 18, 2018 at 10:07 AM
>>> >
>>> > We would appreciate your help to see what is wrong?
>>> >
>>> > Thank you.
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>>> >
>>> > > You don’t say what happens, just that it is not working. I assume
>>> nothing
>>> > > is replaced? Perhaps the pattern should be
>>> > >
>>> > >
>>> > >
>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>> > >
>>> > >
>>> > >
>>> > > ??
>>> > >
>>> > >
>>> > >
>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> > > Windows 10
>>> > >
>>> > >
>>> > >
>>> > > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> > >
>>> > >
>>> > >
>>> > > Hi,
>>> > >
>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>> than
>>> > two
>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
>>> > \n),
>>> > > and replace it with two <br>.
>>> > >
>>> > > I use the following regex pattern and it is working when I test it in
>>> > > regex101.com. But it is not working when I put it inside the
>>> > > RegexReplaceProcessorFactory as below:
>>> > >
>>> > > <updateRequestProcessorChain name="removeCode">
>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>> > >    <str name="fieldName">content</str>
>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > > </processor>
>>> > >           </updateRequestProcessorChain>
>>> > >
>>> > > To explain further about my regex pattern, \s* is instructing the
>>> regex
>>> > to
>>> > > match any \n that have space after and {2,} is instructing the regex
>>> to
>>> > > match 2 or more occurrence of such pattern (\n).
>>> > >
>>> > > Please kindly let me know what is wrong and how should I do it?
>>> > >
>>> > > I am using Solr 7.6.0.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> >
>>>
>>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Regarding the regex (\n\s*){2,} that we are using, when we try in on
https://regex101.com/, it is able to give us the correct result for all the
examples (ie: All of them will only have <br><br>, and not more than that
like what we are getting in Solr in our earlier examples).

Could there be a possibility of a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Paul,
>
> We have tried it with the space preceeding the \n i.e. <str
> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> However, we are also getting the exact same results as the earlier Example
> 1, 2 and 3.
>
> As for your point 2 on perhaps in the data you have other (non printing)
> characters than \n, we have find that there are no non printing characters.
> It is just next line with a space. You can refer to the original content in
> the same examples below.
>
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> http://www.concordpri.moe.edu.sg/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018 at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
>
> Appreciate any other ideas or suggestions that you may have.
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:
>
>> Hi Edwin
>>
>>
>>
>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>> <str name="pattern">(\s*\n){2,}</str>
>>   2.  Perhaps in the data you have other (non printing) characters than
>> \n?
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> We have tried this suggested regex pattern as follow:
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\s*){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
>> Tue, Dec 18, 2018 at 10:07 AM
>>
>> Any further suggestion?
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>>
>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>> > part you could try
>> >
>> >
>> >
>> > <str name="pattern">(\n\s*){2,}</str>
>> >
>> >
>> >
>> > If you also want to match CRLF then
>> >
>> > <str name="pattern">(\r?\n\s*){2,}</str>
>> >
>> >
>> >
>> >
>> >
>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > Thanks for your reply.
>> >
>> > When I use this pattern:
>> > <processor class="solr.RegexReplaceProcessorFactory">
>> >    <str name="fieldName">content</str>
>> >    <str name="pattern">(\n+\s*){2,}</str>
>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > </processor>
>> >
>> > It is working for some sentence within the same content and not working
>> for
>> > some sentences. Please see below for the one that is working and another
>> > that is not working (partially working):
>> >
>> > Example 1: The sentence that the above regex pattern is working
>> correctly
>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>> >
>> > Example 2: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> > Chu Kang Avenue 4, Singapore
>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> > Chu Kang Avenue 4, Singapore
>> >
>> > Example 3: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> > \n\n
>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> > at 10:07 AM
>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On
>> > Tue, Dec 18, 2018 at 10:07 AM
>> >
>> > We would appreciate your help to see what is wrong?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>> >
>> > > You don’t say what happens, just that it is not working. I assume
>> nothing
>> > > is replaced? Perhaps the pattern should be
>> > >
>> > >
>> > >
>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>> > >
>> > >
>> > >
>> > > ??
>> > >
>> > >
>> > >
>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > > Windows 10
>> > >
>> > >
>> > >
>> > > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>> > > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>> > >
>> > >
>> > >
>> > > Hi,
>> > >
>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>> than
>> > two
>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
>> > \n),
>> > > and replace it with two <br>.
>> > >
>> > > I use the following regex pattern and it is working when I test it in
>> > > regex101.com. But it is not working when I put it inside the
>> > > RegexReplaceProcessorFactory as below:
>> > >
>> > > <updateRequestProcessorChain name="removeCode">
>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> > >    <str name="fieldName">content</str>
>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > > </processor>
>> > >           </updateRequestProcessorChain>
>> > >
>> > > To explain further about my regex pattern, \s* is instructing the
>> regex
>> > to
>> > > match any \n that have space after and {2,} is instructing the regex
>> to
>> > > match 2 or more occurrence of such pattern (\n).
>> > >
>> > > Please kindly let me know what is wrong and how should I do it?
>> > >
>> > > I am using Solr 7.6.0.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> >
>>
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

We have tried it with the space preceeding the \n i.e. <str
name="pattern">(\s*\n){2,}</str>, with the following regex pattern:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\s*\n){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

However, we are also getting the exact same results as the earlier Example
1, 2 and 3.

As for your point 2 on perhaps in the data you have other (non printing)
characters than \n, we have find that there are no non printing characters.
It is just next line with a space. You can refer to the original content in
the same examples below.


Example 1: The sentence that the above regex pattern is working correctly
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

http://www.concordpri.moe.edu.sg/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM


Appreciate any other ideas or suggestions that you may have.

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:49, <pa...@ub.unibe.ch> wrote:

> Hi Edwin
>
>
>
>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
> <str name="pattern">(\s*\n){2,}</str>
>   2.  Perhaps in the data you have other (non printing) characters than \n?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:23
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> We have tried this suggested regex pattern as follow:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> But we still have exactly the same problem of Example 1,2 and 3 below.
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> Any further suggestion?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:
>
> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> > part you could try
> >
> >
> >
> > <str name="pattern">(\n\s*){2,}</str>
> >
> >
> >
> > If you also want to match CRLF then
> >
> > <str name="pattern">(\r?\n\s*){2,}</str>
> >
> >
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 15:10
> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > Thanks for your reply.
> >
> > When I use this pattern:
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(\n+\s*){2,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >
> > It is working for some sentence within the same content and not working
> for
> > some sentences. Please see below for the one that is working and another
> > that is not working (partially working):
> >
> > Example 1: The sentence that the above regex pattern is working correctly
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Index content: *    Dear Sir,  <br><br>I am terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> > Chu Kang Avenue 4, Singapore
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> > \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018
> > at 10:07 AM
> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> <br><br>On
> > Tue, Dec 18, 2018 at 10:07 AM
> >
> > We would appreciate your help to see what is wrong?
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> > On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
> >
> > > You don’t say what happens, just that it is not working. I assume
> nothing
> > > is replaced? Perhaps the pattern should be
> > >
> > >
> > >
> > >    <str name="pattern">"(\n\s*){2,}"</str>
> > >
> > >
> > >
> > > ??
> > >
> > >
> > >
> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > > Windows 10
> > >
> > >
> > >
> > > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> > >
> > >
> > >
> > > Hi,
> > >
> > > I am trying to use the RegexReplaceProcessorFactory to remove more than
> > two
> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> > \n),
> > > and replace it with two <br>.
> > >
> > > I use the following regex pattern and it is working when I test it in
> > > regex101.com. But it is not working when I put it inside the
> > > RegexReplaceProcessorFactory as below:
> > >
> > > <updateRequestProcessorChain name="removeCode">
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >    <str name="fieldName">content</str>
> > >    <str name="pattern">"(\\n\s*){2,}"</str>
> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > > </processor>
> > >           </updateRequestProcessorChain>
> > >
> > > To explain further about my regex pattern, \s* is instructing the regex
> > to
> > > match any \n that have space after and {2,} is instructing the regex to
> > > match 2 or more occurrence of such pattern (\n).
> > >
> > > Please kindly let me know what is wrong and how should I do it?
> > >
> > > I am using Solr 7.6.0.
> > >
> > > Regards,
> > > Edwin
> > >
> >
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
Hi Edwin



  1.  Sorry, the pattern was wrong, the space should preceed the \n i.e. <str name="pattern">(\s*\n){2,}</str>
  2.  Perhaps in the data you have other (non printing) characters than \n?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 15:23
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

We have tried this suggested regex pattern as follow:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> <str name="pattern">(\n\s*){2,}</str>
>
>
>
> If you also want to match CRLF then
>
> <str name="pattern">(\r?\n\s*){2,}</str>
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n+\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >    <str name="pattern">"(\n\s*){2,}"</str>
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two <br>.
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > <updateRequestProcessorChain name="removeCode">
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">"(\\n\s*){2,}"</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >           </updateRequestProcessorChain>
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

We have tried this suggested regex pattern as follow:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20, <pa...@ub.unibe.ch> wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> <str name="pattern">(\n\s*){2,}</str>
>
>
>
> If you also want to match CRLF then
>
> <str name="pattern">(\r?\n\s*){2,}</str>
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n+\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >    <str name="pattern">"(\n\s*){2,}"</str>
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two <br>.
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > <updateRequestProcessorChain name="removeCode">
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">"(\\n\s*){2,}"</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >           </updateRequestProcessorChain>
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
To avoid the «\n+\s*» matching too many \n and then failing on the {2,} part you could try



<str name="pattern">(\n\s*){2,}</str>



If you also want to match CRLF then

<str name="pattern">(\r?\n\s*){2,}</str>





Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 15:10
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

Thanks for your reply.

When I use this pattern:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n+\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>    <str name="pattern">"(\n\s*){2,}"</str>
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two <br>.
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> <updateRequestProcessorChain name="removeCode">
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">"(\\n\s*){2,}"</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>           </updateRequestProcessorChain>
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.
Hi Paul,

Thanks for your reply.

When I use this pattern:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n+\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24, <pa...@ub.unibe.ch> wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>    <str name="pattern">"(\n\s*){2,}"</str>
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two <br>.
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> <updateRequestProcessorChain name="removeCode">
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">"(\\n\s*){2,}"</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>           </updateRequestProcessorChain>
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

Posted by pa...@ub.unibe.ch.
You don’t say what happens, just that it is not working. I assume nothing is replaced? Perhaps the pattern should be



   <str name="pattern">"(\n\s*){2,}"</str>



??



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<ma...@gmail.com>
Gesendet: Donnerstag, 7. Februar 2019 14:08
An: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two <br>.

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:

<updateRequestProcessorChain name="removeCode">
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">"(\\n\s*){2,}"</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>
          </updateRequestProcessorChain>

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin