You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Alan Wang <wp...@163.com> on 2021/02/04 01:22:34 UTC

Re:Re: Rule based sentence detector



Hi all, I have created a PR in github repo: https://github.com/apache/opennlp/pull/390, anyone has any comments please feel free.




Thanks
Alan








At 2021-01-22 02:59:40, "William Colen" <co...@apache.org> wrote:
>Hi Alan,
>
>Do you have a PR for the implementation?
>
>Thank you,
>William
>
>Em ter., 19 de jan. de 2021 às 23:52, Alan Wang <wp...@163.com> escreveu:
>
>> Hi all,
>>
>> I created a rule based sentence detector for OpenNLP
>> <https://issues.apache.org/jira/browse/OPENNLP-912>.
>> There are two kinds of rules:
>>
>> 1. break rules: specifying the sentence break
>> 2. no-break rules: disallowing the sentence break
>>
>> All rules have two parts:
>>
>> Before the break
>> After the break
>>
>> The algorithm idea:
>>
>> Retrieves the break rules.
>> If none of the no-break rules is matched at the break location, the text
>> is marked as split and a new segment is created
>>
>> Features:
>>
>> Text Cleanup and Preprocessing
>> Easy to extend other languages
>>
>> Reference:
>>
>> This library use "Golden Rule" test of pragmatic_segmenter
>> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>>
>> Currently, the pass rate of test cases is 92.31%. The following test cases
>> fail: 39, 50, 53, 52
>> For details, see the attachment.
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>