You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Carlos Scheidecker <na...@gmail.com> on 2014/05/20 12:08:21 UTC

How to use DefaultModelBuilderUtil

Hello all,

I am putting this question on its own thread not to get lost.

Question is about the proper usage of DefaultModelBuilderUtil.

I have not figured out the proper format of the files. Here' s what I think
from what I have been reading. Tell me if I am write.

>From class DefaultModelBuilderUtil method generateModel

@param sentences        a file that contains one sentence per line.
    *                                 There should be at least 15K sentences
    *                                 consisting of a representative sample
from
    *                                 user data

This seems to be a text file where each sentence is on one line.
I wonder if it has to be annotated, for instance:

<START:person> Archimedes <END> used the method of exhaustion to
approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
to estimate π rigorously .

Or just:

Archimedes used the method of exhaustion to approximate the value of
π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .


@param knownEntities            a file consisting of a simple list of
   *                                 unambiguous entities, one entry per
line.
   *                                 For instance, if one was trying to
build a
   *                                 person NER model then this file would
be a
   *                                 list of person names that are
unambiguous
   *                                 and are known to exist in the sentences

This would be a text file list?

Something like one name per line?

Archimedes
Socrates
....


* @param knownEntitiesBlacklist   This file contains a list of known bad
hits
   *                                 that the NER phase of this processing
might
   *                                 catch early one before the model
iterates
   *                                 to maturity

Same as the knownEntities but a list of what NOT to mark as an entity?


The rest seemed quite straight forward.

Thanks,

Re: How to use DefaultModelBuilderUtil

Posted by Carlos Scheidecker <na...@gmail.com>.
Mark,

Great. Then I have lots of sentence templates will use that. When I'm done
will report to you but won't be able to attack this just now.

Thanks again.

cheers,


On Tue, May 20, 2014 at 5:31 AM, Mark G <gi...@gmail.com> wrote:

> That is correct , sentence file does not need annotations, and the other
> files Are one name per line.
> It uses the names file to annotate the sentences, and won't annotate
> anything that's in the blacklist file.
>
>
>
> Let me know how it goes!
>
> Sent from my iPhone
>
> > On May 20, 2014, at 6:08 AM, Carlos Scheidecker <na...@gmail.com>
> wrote:
> >
> > Hello all,
> >
> > I am putting this question on its own thread not to get lost.
> >
> > Question is about the proper usage of DefaultModelBuilderUtil.
> >
> > I have not figured out the proper format of the files. Here' s what I
> think
> > from what I have been reading. Tell me if I am write.
> >
> > From class DefaultModelBuilderUtil method generateModel
> >
> > @param sentences        a file that contains one sentence per line.
> >    *                                 There should be at least 15K
> sentences
> >    *                                 consisting of a representative
> sample
> > from
> >    *                                 user data
> >
> > This seems to be a text file where each sentence is on one line.
> > I wonder if it has to be annotated, for instance:
> >
> > <START:person> Archimedes <END> used the method of exhaustion to
> > approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
> > to estimate π rigorously .
> >
> > Or just:
> >
> > Archimedes used the method of exhaustion to approximate the value of
> > π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π
> rigorously .
> >
> >
> > @param knownEntities            a file consisting of a simple list of
> >   *                                 unambiguous entities, one entry per
> > line.
> >   *                                 For instance, if one was trying to
> > build a
> >   *                                 person NER model then this file would
> > be a
> >   *                                 list of person names that are
> > unambiguous
> >   *                                 and are known to exist in the
> sentences
> >
> > This would be a text file list?
> >
> > Something like one name per line?
> >
> > Archimedes
> > Socrates
> > ....
> >
> >
> > * @param knownEntitiesBlacklist   This file contains a list of known bad
> > hits
> >   *                                 that the NER phase of this processing
> > might
> >   *                                 catch early one before the model
> > iterates
> >   *                                 to maturity
> >
> > Same as the knownEntities but a list of what NOT to mark as an entity?
> >
> >
> > The rest seemed quite straight forward.
> >
> > Thanks,
>

Re: How to use DefaultModelBuilderUtil

Posted by Mark G <gi...@gmail.com>.
That is correct , sentence file does not need annotations, and the other files Are one name per line. 
It uses the names file to annotate the sentences, and won't annotate anything that's in the blacklist file.



Let me know how it goes!

Sent from my iPhone

> On May 20, 2014, at 6:08 AM, Carlos Scheidecker <na...@gmail.com> wrote:
> 
> Hello all,
> 
> I am putting this question on its own thread not to get lost.
> 
> Question is about the proper usage of DefaultModelBuilderUtil.
> 
> I have not figured out the proper format of the files. Here' s what I think
> from what I have been reading. Tell me if I am write.
> 
> From class DefaultModelBuilderUtil method generateModel
> 
> @param sentences        a file that contains one sentence per line.
>    *                                 There should be at least 15K sentences
>    *                                 consisting of a representative sample
> from
>    *                                 user data
> 
> This seems to be a text file where each sentence is on one line.
> I wonder if it has to be annotated, for instance:
> 
> <START:person> Archimedes <END> used the method of exhaustion to
> approximate the value of π.Archimedes ( 287&ndash ;212 BC ) was the first
> to estimate π rigorously .
> 
> Or just:
> 
> Archimedes used the method of exhaustion to approximate the value of
> π.Archimedes ( 287&ndash ;212 BC ) was the first to estimate π rigorously .
> 
> 
> @param knownEntities            a file consisting of a simple list of
>   *                                 unambiguous entities, one entry per
> line.
>   *                                 For instance, if one was trying to
> build a
>   *                                 person NER model then this file would
> be a
>   *                                 list of person names that are
> unambiguous
>   *                                 and are known to exist in the sentences
> 
> This would be a text file list?
> 
> Something like one name per line?
> 
> Archimedes
> Socrates
> ....
> 
> 
> * @param knownEntitiesBlacklist   This file contains a list of known bad
> hits
>   *                                 that the NER phase of this processing
> might
>   *                                 catch early one before the model
> iterates
>   *                                 to maturity
> 
> Same as the knownEntities but a list of what NOT to mark as an entity?
> 
> 
> The rest seemed quite straight forward.
> 
> Thanks,