You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Alan Wang <wp...@163.com> on 2021/01/20 02:52:25 UTC

Rule based sentence detector

Hi all,


I created a rule based sentence detector for OpenNLP.
There are two kinds of rules:
1. break rules: specifying the sentence break
2. no-break rules: disallowing the sentence break
All rules have two parts:
Before the break
After the break
The algorithm idea:
Retrieves the break rules.
If none of the no-break rules is matched at the break location, the text is marked as split and a new segment is created
Features:
Text Cleanup and Preprocessing
Easy to extend other languages
Reference:
This library use "Golden Rule" test of pragmatic_segmenter
Currently, the pass rate of test cases is 92.31%. The following test cases fail: 39, 50, 53, 52
For details, see the attachment.

Re:Re: Rule based sentence detector

Posted by Alan Wang <wp...@163.com>.



Hi all, I have created a PR in github repo: https://github.com/apache/opennlp/pull/390, anyone has any comments please feel free.




Thanks
Alan








At 2021-01-22 02:59:40, "William Colen" <co...@apache.org> wrote:
>Hi Alan,
>
>Do you have a PR for the implementation?
>
>Thank you,
>William
>
>Em ter., 19 de jan. de 2021 às 23:52, Alan Wang <wp...@163.com> escreveu:
>
>> Hi all,
>>
>> I created a rule based sentence detector for OpenNLP
>> <https://issues.apache.org/jira/browse/OPENNLP-912>.
>> There are two kinds of rules:
>>
>> 1. break rules: specifying the sentence break
>> 2. no-break rules: disallowing the sentence break
>>
>> All rules have two parts:
>>
>> Before the break
>> After the break
>>
>> The algorithm idea:
>>
>> Retrieves the break rules.
>> If none of the no-break rules is matched at the break location, the text
>> is marked as split and a new segment is created
>>
>> Features:
>>
>> Text Cleanup and Preprocessing
>> Easy to extend other languages
>>
>> Reference:
>>
>> This library use "Golden Rule" test of pragmatic_segmenter
>> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>>
>> Currently, the pass rate of test cases is 92.31%. The following test cases
>> fail: 39, 50, 53, 52
>> For details, see the attachment.
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>

Re: Rule based sentence detector

Posted by William Colen <co...@apache.org>.

Hi Alan,

Do you have a PR for the implementation?

Thank you,
William

Em ter., 19 de jan. de 2021 às 23:52, Alan Wang <wp...@163.com> escreveu:

> Hi all,
>
> I created a rule based sentence detector for OpenNLP
> <https://issues.apache.org/jira/browse/OPENNLP-912>.
> There are two kinds of rules:
>
> 1. break rules: specifying the sentence break
> 2. no-break rules: disallowing the sentence break
>
> All rules have two parts:
>
> Before the break
> After the break
>
> The algorithm idea:
>
> Retrieves the break rules.
> If none of the no-break rules is matched at the break location, the text
> is marked as split and a new segment is created
>
> Features:
>
> Text Cleanup and Preprocessing
> Easy to extend other languages
>
> Reference:
>
> This library use "Golden Rule" test of pragmatic_segmenter
> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>
> Currently, the pass rate of test cases is 92.31%. The following test cases
> fail: 39, 50, 53, 52
> For details, see the attachment.
>
> ------------------------------
>
>
>
>
>
>