You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Em <ma...@yahoo.de> on 2011/12/04 14:12:12 UTC

Re: How does good training data look like?

Hi Vyacheslav,

I started using OpenNLP in my free-time. As I promised, I share my
outcomes of some tests - although I did not hit the 2.500
tagged-sentences due to time-reasons.

Unfortunately I didn't had the time to backup my outcomes with numbers,
statistics or something like that (since I just played around with data
and OpenNLP).

However I saw a correlation between the length of the training-data and
the length of the data you want to tag with OpenNLP.

I took several sub-passages of articles (let's call them test-set) of
training data and tried to tag them.
The result of my test was: If your training data's average length was
"long" and you test it with short test-sets, precision and recall are
relatively bad.
For example I took 10 sentences of my training-data and tried out to tag
them, starting with using just one sentence and going to go up to 10.
Additionally I rewrote the sentences so that their sense would be the
same to a human, while looking completely different in comparison to the
original sentences. The results were the same.
If openNLP detected something in one-sentence-examples, it often was one
of 2-3 entities. Some sentences remained untouched although they
contained entities.
However, if I combined more than one sentence with eachother openNLP
became much more accurate (since they formed a more typical
document-length than the one-sentence-examples).

Training-data:
My training-data contained passages of wiki-articles. The average length
was several sentences per document. To me a document was everything that
belongs to a specific headline in a wikipedia-article, grouped by the
most specific headline.
So if a H3-tag contained several H4-tags, every content belonging to
that H4-tag formed a document.
As a consequence there were several longer documents, since h4-tags were
rarely used.

I decided to create a document for every most-specific headline, because
the wiki contained documents of every kind of length - some to fill
short books with 10 or more pages, others with just one sentences per
article. Choosing the most-specific headline to form a document made it
possible to have a good mix of document lengths.

I am glad to hear your feedback!

Hope this helps,
Em


Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:
> Hi Em,
> 
> could you please share the outcome when you have some results. I would be interested to hear them
> 
> Thanks,
> Vyacheslav 
> 
> On Oct 5, 2011, at 11:08 PM, Em wrote:
> 
>> Thanks Jörn!
>>
>> I'll experiment with this.
>>
>> Regards,
>> Em
>>
>> Am 05.10.2011 19:47, schrieb Jörn Kottmann:
>>> On 10/3/11 10:30 AM, Em wrote:
>>>> What about document's length?
>>>> Just as an example: The production-data will contain documents with a
>>>> length of several pages as well as very short texts containing only a
>>>> few sentences.
>>>>
>>>> I think about chunking the long documents into smaller ones (i.e. a page
>>>> of a longer document is splitted into an individual doc). Does this
>>>> makes sense?
>>>
>>> I would first try to process a long document at once. If you encounter any
>>> issues you could just call clearAdaptiveData before the end of the
>>> document.
>>> But as Olivier said, you might just want to include a couple of these in
>>> your training
>>> data.
>>>
>>> Jörn
>>>
> 
> Best,
> Vyacheslav
> 
> 
> 
> 

Re: How does good training data look like?

Posted by Jörn Kottmann <ko...@gmail.com>.
Our default cutoff of 5 doesn't really work when
you just have 200 sentences of training data.

Jörn

On 12/7/11 5:46 PM, Vyacheslav Zholudev wrote:
> Hi Em,
>
> thanks for sharing your experience.
> I can't give you any suggestions since I switched to another tool for named-entity recognition (called CRF++). It implements conditional random fields which allows to tag multiple types of entities, e.g. location and organization. This particular tool is kinda low-level: you have to convert your training data and a sentence to be tagged first to a specific format, but overall this tool works really well for me even on a small number of annotated examples (200 sentences for my particular use case)
>
> Vyacheslav
>
> On Dec 4, 2011, at 2:12 PM, Em wrote:
>
>> Hi Vyacheslav,
>>
>> I started using OpenNLP in my free-time. As I promised, I share my
>> outcomes of some tests - although I did not hit the 2.500
>> tagged-sentences due to time-reasons.
>>
>> Unfortunately I didn't had the time to backup my outcomes with numbers,
>> statistics or something like that (since I just played around with data
>> and OpenNLP).
>>
>> However I saw a correlation between the length of the training-data and
>> the length of the data you want to tag with OpenNLP.
>>
>> I took several sub-passages of articles (let's call them test-set) of
>> training data and tried to tag them.
>> The result of my test was: If your training data's average length was
>> "long" and you test it with short test-sets, precision and recall are
>> relatively bad.
>> For example I took 10 sentences of my training-data and tried out to tag
>> them, starting with using just one sentence and going to go up to 10.
>> Additionally I rewrote the sentences so that their sense would be the
>> same to a human, while looking completely different in comparison to the
>> original sentences. The results were the same.
>> If openNLP detected something in one-sentence-examples, it often was one
>> of 2-3 entities. Some sentences remained untouched although they
>> contained entities.
>> However, if I combined more than one sentence with eachother openNLP
>> became much more accurate (since they formed a more typical
>> document-length than the one-sentence-examples).
>>
>> Training-data:
>> My training-data contained passages of wiki-articles. The average length
>> was several sentences per document. To me a document was everything that
>> belongs to a specific headline in a wikipedia-article, grouped by the
>> most specific headline.
>> So if a H3-tag contained several H4-tags, every content belonging to
>> that H4-tag formed a document.
>> As a consequence there were several longer documents, since h4-tags were
>> rarely used.
>>
>> I decided to create a document for every most-specific headline, because
>> the wiki contained documents of every kind of length - some to fill
>> short books with 10 or more pages, others with just one sentences per
>> article. Choosing the most-specific headline to form a document made it
>> possible to have a good mix of document lengths.
>>
>> I am glad to hear your feedback!
>>
>> Hope this helps,
>> Em
>>
>>
>> Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:
>>> Hi Em,
>>>
>>> could you please share the outcome when you have some results. I would be interested to hear them
>>>
>>> Thanks,
>>> Vyacheslav
>>>
>>> On Oct 5, 2011, at 11:08 PM, Em wrote:
>>>
>>>> Thanks Jörn!
>>>>
>>>> I'll experiment with this.
>>>>
>>>> Regards,
>>>> Em
>>>>
>>>> Am 05.10.2011 19:47, schrieb Jörn Kottmann:
>>>>> On 10/3/11 10:30 AM, Em wrote:
>>>>>> What about document's length?
>>>>>> Just as an example: The production-data will contain documents with a
>>>>>> length of several pages as well as very short texts containing only a
>>>>>> few sentences.
>>>>>>
>>>>>> I think about chunking the long documents into smaller ones (i.e. a page
>>>>>> of a longer document is splitted into an individual doc). Does this
>>>>>> makes sense?
>>>>> I would first try to process a long document at once. If you encounter any
>>>>> issues you could just call clearAdaptiveData before the end of the
>>>>> document.
>>>>> But as Olivier said, you might just want to include a couple of these in
>>>>> your training
>>>>> data.
>>>>>
>>>>> Jörn
>>>>>
>>> Best,
>>> Vyacheslav
>>>
>>>
>>>
>>>


Re: How does good training data look like?

Posted by Vyacheslav Zholudev <vy...@gmail.com>.
Hi Em,

thanks for sharing your experience. 
I can't give you any suggestions since I switched to another tool for named-entity recognition (called CRF++). It implements conditional random fields which allows to tag multiple types of entities, e.g. location and organization. This particular tool is kinda low-level: you have to convert your training data and a sentence to be tagged first to a specific format, but overall this tool works really well for me even on a small number of annotated examples (200 sentences for my particular use case)  

Vyacheslav

On Dec 4, 2011, at 2:12 PM, Em wrote:

> Hi Vyacheslav,
> 
> I started using OpenNLP in my free-time. As I promised, I share my
> outcomes of some tests - although I did not hit the 2.500
> tagged-sentences due to time-reasons.
> 
> Unfortunately I didn't had the time to backup my outcomes with numbers,
> statistics or something like that (since I just played around with data
> and OpenNLP).
> 
> However I saw a correlation between the length of the training-data and
> the length of the data you want to tag with OpenNLP.
> 
> I took several sub-passages of articles (let's call them test-set) of
> training data and tried to tag them.
> The result of my test was: If your training data's average length was
> "long" and you test it with short test-sets, precision and recall are
> relatively bad.
> For example I took 10 sentences of my training-data and tried out to tag
> them, starting with using just one sentence and going to go up to 10.
> Additionally I rewrote the sentences so that their sense would be the
> same to a human, while looking completely different in comparison to the
> original sentences. The results were the same.
> If openNLP detected something in one-sentence-examples, it often was one
> of 2-3 entities. Some sentences remained untouched although they
> contained entities.
> However, if I combined more than one sentence with eachother openNLP
> became much more accurate (since they formed a more typical
> document-length than the one-sentence-examples).
> 
> Training-data:
> My training-data contained passages of wiki-articles. The average length
> was several sentences per document. To me a document was everything that
> belongs to a specific headline in a wikipedia-article, grouped by the
> most specific headline.
> So if a H3-tag contained several H4-tags, every content belonging to
> that H4-tag formed a document.
> As a consequence there were several longer documents, since h4-tags were
> rarely used.
> 
> I decided to create a document for every most-specific headline, because
> the wiki contained documents of every kind of length - some to fill
> short books with 10 or more pages, others with just one sentences per
> article. Choosing the most-specific headline to form a document made it
> possible to have a good mix of document lengths.
> 
> I am glad to hear your feedback!
> 
> Hope this helps,
> Em
> 
> 
> Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:
>> Hi Em,
>> 
>> could you please share the outcome when you have some results. I would be interested to hear them
>> 
>> Thanks,
>> Vyacheslav 
>> 
>> On Oct 5, 2011, at 11:08 PM, Em wrote:
>> 
>>> Thanks Jörn!
>>> 
>>> I'll experiment with this.
>>> 
>>> Regards,
>>> Em
>>> 
>>> Am 05.10.2011 19:47, schrieb Jörn Kottmann:
>>>> On 10/3/11 10:30 AM, Em wrote:
>>>>> What about document's length?
>>>>> Just as an example: The production-data will contain documents with a
>>>>> length of several pages as well as very short texts containing only a
>>>>> few sentences.
>>>>> 
>>>>> I think about chunking the long documents into smaller ones (i.e. a page
>>>>> of a longer document is splitted into an individual doc). Does this
>>>>> makes sense?
>>>> 
>>>> I would first try to process a long document at once. If you encounter any
>>>> issues you could just call clearAdaptiveData before the end of the
>>>> document.
>>>> But as Olivier said, you might just want to include a couple of these in
>>>> your training
>>>> data.
>>>> 
>>>> Jörn
>>>> 
>> 
>> Best,
>> Vyacheslav
>> 
>> 
>> 
>>