You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Riccardo Tasso <ri...@gmail.com> on 2013/03/25 16:31:04 UTC

Speech Detection

Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
(without abbreviations) which represent speeches.

I have a quite big data-set annotated by human experts in which each
document is a line of text, segmented in one or more pieces depending on
our needs.

To better understand my case, if the line is the following:
I'm not able to play tennis - he said - You're right - replied his wife

The right segmentation should be:
I'm not able to play tennis
 - he said -
You're right
 - replied his wife

I decided to try a statistical approach to segment my text, and the
SentenceDetector seems to be the right choice to me.

I've build the training set in the format specified in
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training
which
is:

   - one segment per line
   - a blank line to separate two documents

To evaluate performance I've divided my dataset in one for training and one
for validation but the performance was quite low:
Precision: 0.4485549132947977
Recall: 0.3038371182458888
F-Measure: 0.3622782446311859

Since I've used default values I guess there should be some way to obtain
better results...or maybe do I need another model?

Thanks,
   Riccardo

Re: Speech Detection

Posted by Riccardo Tasso <ri...@gmail.com>.
Thank you.

I've decided to try also a simpler rule based approach and it performs
quite well.
Anyway this discussion was very useful to me.

Cheers,
   Riccardo


2013/3/29 William Colen <wi...@gmail.com>

> Looking again to your sample, I believe you won't be able have good results
> using OpenNLP standard learnable Sentence Detector, and maybe any other
> ready to use tool. Your segmentation relies on some language knowledge that
> is hidden at this level of processing. Maybe you will have to combine
> sentence segmentation with POS tagging, or clause categorization to have
> good results.
>
> On Tue, Mar 26, 2013 at 10:30 AM, Jörn Kottmann <ko...@gmail.com>
> wrote:
>
> > Hello,
> >
> > the sentence detector only considers EOS chars as potential
> > sentence boundaries, it should not be difficult to extend/modify it so
> > that locations detected by user code are used for the split decision.
> >
> > The iterations specify the maximum number of iterations for an iterative
> > machine learning algorithm, and cutoff removes features which did not
> > occur at least n times in the training data.
> >
> > Jörn
> >
> >
> > On 03/26/2013 01:52 PM, Riccardo Tasso wrote:
> >
> >> Thank you Jörn, in fact the results improved a lot:
> >> Precision: 0.5325131810193322
> >> Recall: 0.4745497259201253
> >> F-Measure: 0.5018633540372671
> >>
> >> I guess the splitter could have better results if it were able to detect
> >> parenthetic structure such as:
> >> some text - speech - other text
> >> which in my dataset is splitted as:
> >> some text
> >> - speech -
> >> other text
> >> Is it possible?
> >>
> >> Another optimization should be the one which could detect symbols to
> end a
> >> sentence longer than one character, for example "...".
> >>
> >> Can you tell me more about the following parameters?
> >>
> >>     - iterations
> >>     - cutoff
> >>
> >> Is there any guideline on how tune them?
> >>
> >> Cheers,
> >> Riccardo
> >>
> >>
> >>
> >> 2013/3/26 Jörn Kottmann <ko...@gmail.com>
> >>
> >>  On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
> >>>
> >>>  Is the Sentence Detector able to split also on non dot characters? In
> my
> >>>> case there should be also other characters delimiting the end of a
> >>>> segment,
> >>>> such as: colon (:), dash (-), various kind of quotation marks (", `,
> ',
> >>>> ...).
> >>>>
> >>>>  The Sentence Detector can only split on end-of-sentence characters,
> by
> >>> default these
> >>> are . ! ? but with 1.5.3 you can set them during training to your
> custom
> >>> set, there is
> >>> a command line argument for it on the Sentence Detector Trainer, haver
> a
> >>> look at the help.
> >>>
> >>> If you don't want to compile yourself use the 1.5.3 RC2 which we are
> >>> currently testing.
> >>>
> >>> Jörn
> >>>
> >>>
> >>>
> >>>
> >
>

Re: Speech Detection

Posted by William Colen <wi...@gmail.com>.
Looking again to your sample, I believe you won't be able have good results
using OpenNLP standard learnable Sentence Detector, and maybe any other
ready to use tool. Your segmentation relies on some language knowledge that
is hidden at this level of processing. Maybe you will have to combine
sentence segmentation with POS tagging, or clause categorization to have
good results.

On Tue, Mar 26, 2013 at 10:30 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> the sentence detector only considers EOS chars as potential
> sentence boundaries, it should not be difficult to extend/modify it so
> that locations detected by user code are used for the split decision.
>
> The iterations specify the maximum number of iterations for an iterative
> machine learning algorithm, and cutoff removes features which did not
> occur at least n times in the training data.
>
> Jörn
>
>
> On 03/26/2013 01:52 PM, Riccardo Tasso wrote:
>
>> Thank you Jörn, in fact the results improved a lot:
>> Precision: 0.5325131810193322
>> Recall: 0.4745497259201253
>> F-Measure: 0.5018633540372671
>>
>> I guess the splitter could have better results if it were able to detect
>> parenthetic structure such as:
>> some text - speech - other text
>> which in my dataset is splitted as:
>> some text
>> - speech -
>> other text
>> Is it possible?
>>
>> Another optimization should be the one which could detect symbols to end a
>> sentence longer than one character, for example "...".
>>
>> Can you tell me more about the following parameters?
>>
>>     - iterations
>>     - cutoff
>>
>> Is there any guideline on how tune them?
>>
>> Cheers,
>> Riccardo
>>
>>
>>
>> 2013/3/26 Jörn Kottmann <ko...@gmail.com>
>>
>>  On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
>>>
>>>  Is the Sentence Detector able to split also on non dot characters? In my
>>>> case there should be also other characters delimiting the end of a
>>>> segment,
>>>> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
>>>> ...).
>>>>
>>>>  The Sentence Detector can only split on end-of-sentence characters, by
>>> default these
>>> are . ! ? but with 1.5.3 you can set them during training to your custom
>>> set, there is
>>> a command line argument for it on the Sentence Detector Trainer, haver a
>>> look at the help.
>>>
>>> If you don't want to compile yourself use the 1.5.3 RC2 which we are
>>> currently testing.
>>>
>>> Jörn
>>>
>>>
>>>
>>>
>

Re: Speech Detection

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

the sentence detector only considers EOS chars as potential
sentence boundaries, it should not be difficult to extend/modify it so
that locations detected by user code are used for the split decision.

The iterations specify the maximum number of iterations for an iterative
machine learning algorithm, and cutoff removes features which did not
occur at least n times in the training data.

Jörn

On 03/26/2013 01:52 PM, Riccardo Tasso wrote:
> Thank you Jörn, in fact the results improved a lot:
> Precision: 0.5325131810193322
> Recall: 0.4745497259201253
> F-Measure: 0.5018633540372671
>
> I guess the splitter could have better results if it were able to detect
> parenthetic structure such as:
> some text - speech - other text
> which in my dataset is splitted as:
> some text
> - speech -
> other text
> Is it possible?
>
> Another optimization should be the one which could detect symbols to end a
> sentence longer than one character, for example "...".
>
> Can you tell me more about the following parameters?
>
>     - iterations
>     - cutoff
>
> Is there any guideline on how tune them?
>
> Cheers,
> Riccardo
>
>
>
> 2013/3/26 Jörn Kottmann <ko...@gmail.com>
>
>> On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
>>
>>> Is the Sentence Detector able to split also on non dot characters? In my
>>> case there should be also other characters delimiting the end of a
>>> segment,
>>> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
>>> ...).
>>>
>> The Sentence Detector can only split on end-of-sentence characters, by
>> default these
>> are . ! ? but with 1.5.3 you can set them during training to your custom
>> set, there is
>> a command line argument for it on the Sentence Detector Trainer, haver a
>> look at the help.
>>
>> If you don't want to compile yourself use the 1.5.3 RC2 which we are
>> currently testing.
>>
>> Jörn
>>
>>
>>


Re: Speech Detection

Posted by William Colen <wi...@gmail.com>.
I should have mentioned that to use your factory you can simply specify the
fully qualified name in the command line tool argument "-factory".


On Tue, Mar 26, 2013 at 10:21 AM, William Colen <wi...@gmail.com>wrote:

> Riccardo,
>
> You can tune your sentence detector using a custom context generator.
>
> At
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/java/opennlp/tools/sentdetect/
> take a look at DummySentenceDetectorFactory.java
> and SentenceDetectorFactoryTest.java
>
> If you prefer a concrete example, take a look at an implementation I did
> for another project:
>
> https://github.com/cogroo/cogroo4/tree/master/cogroo-nlp/src/main/java/org/cogroo/tools/sentdetect
>
> William
>
>
> On Tue, Mar 26, 2013 at 9:52 AM, Riccardo Tasso <ri...@gmail.com>wrote:
>
>> Thank you Jörn, in fact the results improved a lot:
>> Precision: 0.5325131810193322
>> Recall: 0.4745497259201253
>> F-Measure: 0.5018633540372671
>>
>> I guess the splitter could have better results if it were able to detect
>> parenthetic structure such as:
>> some text - speech - other text
>> which in my dataset is splitted as:
>> some text
>> - speech -
>> other text
>> Is it possible?
>>
>> Another optimization should be the one which could detect symbols to end a
>> sentence longer than one character, for example "...".
>>
>> Can you tell me more about the following parameters?
>>
>>    - iterations
>>    - cutoff
>>
>> Is there any guideline on how tune them?
>>
>> Cheers,
>> Riccardo
>>
>>
>>
>> 2013/3/26 Jörn Kottmann <ko...@gmail.com>
>>
>> > On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
>> >
>> >> Is the Sentence Detector able to split also on non dot characters? In
>> my
>> >> case there should be also other characters delimiting the end of a
>> >> segment,
>> >> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
>> >> ...).
>> >>
>> >
>> > The Sentence Detector can only split on end-of-sentence characters, by
>> > default these
>> > are . ! ? but with 1.5.3 you can set them during training to your custom
>> > set, there is
>> > a command line argument for it on the Sentence Detector Trainer, haver a
>> > look at the help.
>> >
>> > If you don't want to compile yourself use the 1.5.3 RC2 which we are
>> > currently testing.
>> >
>> > Jörn
>> >
>> >
>> >
>>
>
>

Re: Speech Detection

Posted by William Colen <wi...@gmail.com>.
Riccardo,

You can tune your sentence detector using a custom context generator.

At
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/java/opennlp/tools/sentdetect/
take a look at DummySentenceDetectorFactory.java
and SentenceDetectorFactoryTest.java

If you prefer a concrete example, take a look at an implementation I did
for another project:
https://github.com/cogroo/cogroo4/tree/master/cogroo-nlp/src/main/java/org/cogroo/tools/sentdetect

William


On Tue, Mar 26, 2013 at 9:52 AM, Riccardo Tasso <ri...@gmail.com>wrote:

> Thank you Jörn, in fact the results improved a lot:
> Precision: 0.5325131810193322
> Recall: 0.4745497259201253
> F-Measure: 0.5018633540372671
>
> I guess the splitter could have better results if it were able to detect
> parenthetic structure such as:
> some text - speech - other text
> which in my dataset is splitted as:
> some text
> - speech -
> other text
> Is it possible?
>
> Another optimization should be the one which could detect symbols to end a
> sentence longer than one character, for example "...".
>
> Can you tell me more about the following parameters?
>
>    - iterations
>    - cutoff
>
> Is there any guideline on how tune them?
>
> Cheers,
> Riccardo
>
>
>
> 2013/3/26 Jörn Kottmann <ko...@gmail.com>
>
> > On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
> >
> >> Is the Sentence Detector able to split also on non dot characters? In my
> >> case there should be also other characters delimiting the end of a
> >> segment,
> >> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
> >> ...).
> >>
> >
> > The Sentence Detector can only split on end-of-sentence characters, by
> > default these
> > are . ! ? but with 1.5.3 you can set them during training to your custom
> > set, there is
> > a command line argument for it on the Sentence Detector Trainer, haver a
> > look at the help.
> >
> > If you don't want to compile yourself use the 1.5.3 RC2 which we are
> > currently testing.
> >
> > Jörn
> >
> >
> >
>

Re: Speech Detection

Posted by Riccardo Tasso <ri...@gmail.com>.
Thank you Jörn, in fact the results improved a lot:
Precision: 0.5325131810193322
Recall: 0.4745497259201253
F-Measure: 0.5018633540372671

I guess the splitter could have better results if it were able to detect
parenthetic structure such as:
some text - speech - other text
which in my dataset is splitted as:
some text
- speech -
other text
Is it possible?

Another optimization should be the one which could detect symbols to end a
sentence longer than one character, for example "...".

Can you tell me more about the following parameters?

   - iterations
   - cutoff

Is there any guideline on how tune them?

Cheers,
Riccardo



2013/3/26 Jörn Kottmann <ko...@gmail.com>

> On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
>
>> Is the Sentence Detector able to split also on non dot characters? In my
>> case there should be also other characters delimiting the end of a
>> segment,
>> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
>> ...).
>>
>
> The Sentence Detector can only split on end-of-sentence characters, by
> default these
> are . ! ? but with 1.5.3 you can set them during training to your custom
> set, there is
> a command line argument for it on the Sentence Detector Trainer, haver a
> look at the help.
>
> If you don't want to compile yourself use the 1.5.3 RC2 which we are
> currently testing.
>
> Jörn
>
>
>

Re: Speech Detection

Posted by Jörn Kottmann <ko...@gmail.com>.
On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
> Is the Sentence Detector able to split also on non dot characters? In my
> case there should be also other characters delimiting the end of a segment,
> such as: colon (:), dash (-), various kind of quotation marks (", `, ',
> ...).

The Sentence Detector can only split on end-of-sentence characters, by 
default these
are . ! ? but with 1.5.3 you can set them during training to your custom 
set, there is
a command line argument for it on the Sentence Detector Trainer, haver a 
look at the help.

If you don't want to compile yourself use the 1.5.3 RC2 which we are 
currently testing.

Jörn



Re: Speech Detection

Posted by Riccardo Tasso <ri...@gmail.com>.
I have 1966 very short documents (as in the former example) which are
splitted in 2525 segments.
As you can argue many documents should not be splitted at all.

For evaluation I've just splitted the data-set in two equal parts.

Is the Sentence Detector able to split also on non dot characters? In my
case there should be also other characters delimiting the end of a segment,
such as: colon (:), dash (-), various kind of quotation marks (", `, ',
...).

The other trap is that I shouldn't split on every dot.

For example:
"Hello, my name is Riccardo. I've studied computer science in 2002 - he
said - and I finished in 2009." Then he began to type something on his
keyboard. It was binary code!

Should be segmented as:
"Hello, my name is Riccardo. I've studied computer science in 2002
- he said -
and I finished in 2009."
Then he began to type something on his keyboard. It was binary code!

Cheers,
   Riccardo


2013/3/26 James Kosin <ja...@gmail.com>

> On 3/25/2013 11:31 AM, Riccardo Tasso wrote:
>
>> Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
>> (without abbreviations) which represent speeches.
>>
>> I have a quite big data-set annotated by human experts in which each
>> document is a line of text, segmented in one or more pieces depending on
>> our needs.
>>
>> To better understand my case, if the line is the following:
>> I'm not able to play tennis - he said - You're right - replied his wife
>>
>> The right segmentation should be:
>> I'm not able to play tennis
>>   - he said -
>> You're right
>>   - replied his wife
>>
>> I decided to try a statistical approach to segment my text, and the
>> SentenceDetector seems to be the right choice to me.
>>
>> I've build the training set in the format specified in
>> http://opennlp.apache.org/**documentation/1.5.2-**
>> incubating/manual/opennlp.**html#tools.sentdetect.training<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training>
>> which
>> is:
>>
>>     - one segment per line
>>     - a blank line to separate two documents
>>
>>
>> To evaluate performance I've divided my dataset in one for training and
>> one
>> for validation but the performance was quite low:
>> Precision: 0.4485549132947977
>> Recall: 0.3038371182458888
>> F-Measure: 0.3622782446311859
>>
>> Since I've used default values I guess there should be some way to obtain
>> better results...or maybe do I need another model?
>>
>> Thanks,
>>     Riccardo
>>
>>  Riccardo,
>
> How many sentences, and documents in your training set?
>
> James
>

Re: Speech Detection

Posted by James Kosin <ja...@gmail.com>.
On 3/25/2013 11:31 AM, Riccardo Tasso wrote:
> Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
> (without abbreviations) which represent speeches.
>
> I have a quite big data-set annotated by human experts in which each
> document is a line of text, segmented in one or more pieces depending on
> our needs.
>
> To better understand my case, if the line is the following:
> I'm not able to play tennis - he said - You're right - replied his wife
>
> The right segmentation should be:
> I'm not able to play tennis
>   - he said -
> You're right
>   - replied his wife
>
> I decided to try a statistical approach to segment my text, and the
> SentenceDetector seems to be the right choice to me.
>
> I've build the training set in the format specified in
> http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training
> which
> is:
>
>     - one segment per line
>     - a blank line to separate two documents
>
> To evaluate performance I've divided my dataset in one for training and one
> for validation but the performance was quite low:
> Precision: 0.4485549132947977
> Recall: 0.3038371182458888
> F-Measure: 0.3622782446311859
>
> Since I've used default values I guess there should be some way to obtain
> better results...or maybe do I need another model?
>
> Thanks,
>     Riccardo
>
Riccardo,

How many sentences, and documents in your training set?

James