You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Alan Wang <wp...@163.com> on 2021/01/20 02:52:25 UTC
Rule based sentence detector
Hi all,
I created a rule based sentence detector for OpenNLP.
There are two kinds of rules:
1. break rules: specifying the sentence break
2. no-break rules: disallowing the sentence break
All rules have two parts:
Before the break
After the break
The algorithm idea:
Retrieves the break rules.
If none of the no-break rules is matched at the break location, the text is marked as split and a new segment is created
Features:
Text Cleanup and Preprocessing
Easy to extend other languages
Reference:
This library use "Golden Rule" test of pragmatic_segmenter
Currently, the pass rate of test cases is 92.31%. The following test cases fail: 39, 50, 53, 52
For details, see the attachment.
Re:Re: Rule based sentence detector
Posted by Alan Wang <wp...@163.com>.
Hi all, I have created a PR in github repo: https://github.com/apache/opennlp/pull/390, anyone has any comments please feel free.
Thanks
Alan
At 2021-01-22 02:59:40, "William Colen" <co...@apache.org> wrote:
>Hi Alan,
>
>Do you have a PR for the implementation?
>
>Thank you,
>William
>
>Em ter., 19 de jan. de 2021 às 23:52, Alan Wang <wp...@163.com> escreveu:
>
>> Hi all,
>>
>> I created a rule based sentence detector for OpenNLP
>> <https://issues.apache.org/jira/browse/OPENNLP-912>.
>> There are two kinds of rules:
>>
>> 1. break rules: specifying the sentence break
>> 2. no-break rules: disallowing the sentence break
>>
>> All rules have two parts:
>>
>> Before the break
>> After the break
>>
>> The algorithm idea:
>>
>> Retrieves the break rules.
>> If none of the no-break rules is matched at the break location, the text
>> is marked as split and a new segment is created
>>
>> Features:
>>
>> Text Cleanup and Preprocessing
>> Easy to extend other languages
>>
>> Reference:
>>
>> This library use "Golden Rule" test of pragmatic_segmenter
>> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>>
>> Currently, the pass rate of test cases is 92.31%. The following test cases
>> fail: 39, 50, 53, 52
>> For details, see the attachment.
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>
Re: Rule based sentence detector
Posted by William Colen <co...@apache.org>.
Hi Alan,
Do you have a PR for the implementation?
Thank you,
William
Em ter., 19 de jan. de 2021 às 23:52, Alan Wang <wp...@163.com> escreveu:
> Hi all,
>
> I created a rule based sentence detector for OpenNLP
> <https://issues.apache.org/jira/browse/OPENNLP-912>.
> There are two kinds of rules:
>
> 1. break rules: specifying the sentence break
> 2. no-break rules: disallowing the sentence break
>
> All rules have two parts:
>
> Before the break
> After the break
>
> The algorithm idea:
>
> Retrieves the break rules.
> If none of the no-break rules is matched at the break location, the text
> is marked as split and a new segment is created
>
> Features:
>
> Text Cleanup and Preprocessing
> Easy to extend other languages
>
> Reference:
>
> This library use "Golden Rule" test of pragmatic_segmenter
> <https://github.com/diasks2/pragmatic_segmenter#the-golden-rules>
>
> Currently, the pass rate of test cases is 92.31%. The following test cases
> fail: 39, 50, 53, 52
> For details, see the attachment.
>
> ------------------------------
>
>
>
>
>
>