You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Michael Schmitz <sc...@cs.washington.edu> on 2013/10/01 17:01:13 UTC

Next Steps for OpenNLP

Hi, I've used OpenNLP for a few years--in particular the chunker, POS
tagger, and tokenizer.  We're grateful for a high performance library
with an Apache license, but one of our greatest complaints is the
quality of the models.  Yes--we're aware we can train our own--but
most people are looking for something that is good enough out of the
box (we aim for this with out products).  I'm not surprised that
volunteer engineers don't want to spend their time annotating data ;-)

I'm curious what other people see as the biggest shortcomings for Open
NLP or the most important next steps for OpenNlp.  I may have an
opportunity to contribute to the project and I'm trying to figure out
where the community thinks the biggest impact could be made.

Peace.
Michael Schmitz

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/02/2013 03:59 AM, John Stewart wrote:
> Do we know if there's live interest in using the coref module -- which
> seems like abandonware?  (I've asked this before but I still don't have a
> sense of the level of interest).

The coref model has a few issues which need to be resolved first before 
we can ship it again
as part of the opennlp-tools project.

If you are interested to get some live in it again please start 
contributing to it, the code is
in the sandbox.
The biggest problem it has is that we can't train it anymore, because 
that part of coref
component was never open sourced entirely. I worked on it for a while, 
but the models it produces
are not coming close to how it performs with the models on SourceForge.

Jörn

Re: Next Steps for OpenNLP

Posted by Michael Schmitz <sc...@cs.washington.edu>.
In an effective coreference module?  There's extreme interest from my
research group.  But it's a hard problem and we're dissatisfied with
many of the existing systems.

Peace.  Michael

On Tue, Oct 1, 2013 at 6:59 PM, John Stewart <ca...@gmail.com> wrote:
> Do we know if there's live interest in using the coref module -- which
> seems like abandonware?  (I've asked this before but I still don't have a
> sense of the level of interest).
>
> jds
>
>
> On Tue, Oct 1, 2013 at 8:06 PM, Mark G <gi...@gmail.com> wrote:
>
>> I've been using OpenNLP for a few years and I find the best results occur
>> when the models are generated using samples of the data they will be run
>> against, one of the reasons I like the Maxent approach. I am not sure
>> attempting to provide models will bear much fruit other than users will no
>> longer be afraid of the licensing issues associated with using them in
>> commercial systems. I do strongly think we should provide a modelbuilding
>> framework (that calls the training api) and a default impl.
>> Coincidentally....I have been building a framework and impl over the last
>> few months that creates models based on seeding an iterative process with
>> known entities and iterating through a set of supplied sentences to
>> recursively create annotations, write them, create a maxentmodel, load the
>> model, create more annotations based on the results (there is a validation
>> object involved), and so on.... With this method I was able to create an
>> NER model for people's names against a 200K sentence corpus that returns
>> acceptable results just by starting with a list of five highly unambiguous
>> names. I will propose the framework in more detail in the coming days and
>> supply my impl if everyone is interested.
>> As for the initial question, I would like to see OpenNLP provide a
>> framework for rapidly/semi-automatically building models out of user data,
>> and also performing entity resolution across documents, in order to assign
>> a probability to whether the "Bob" in one document is the same as "Bob" in
>> another.
>> MG
>>
>>
>> On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz
>> <sc...@cs.washington.edu>wrote:
>>
>> > Hi, I've used OpenNLP for a few years--in particular the chunker, POS
>> > tagger, and tokenizer.  We're grateful for a high performance library
>> > with an Apache license, but one of our greatest complaints is the
>> > quality of the models.  Yes--we're aware we can train our own--but
>> > most people are looking for something that is good enough out of the
>> > box (we aim for this with out products).  I'm not surprised that
>> > volunteer engineers don't want to spend their time annotating data ;-)
>> >
>> > I'm curious what other people see as the biggest shortcomings for Open
>> > NLP or the most important next steps for OpenNlp.  I may have an
>> > opportunity to contribute to the project and I'm trying to figure out
>> > where the community thinks the biggest impact could be made.
>> >
>> > Peace.
>> > Michael Schmitz
>> >
>>

Re: Next Steps for OpenNLP

Posted by John Stewart <ca...@gmail.com>.
Do we know if there's live interest in using the coref module -- which
seems like abandonware?  (I've asked this before but I still don't have a
sense of the level of interest).

jds


On Tue, Oct 1, 2013 at 8:06 PM, Mark G <gi...@gmail.com> wrote:

> I've been using OpenNLP for a few years and I find the best results occur
> when the models are generated using samples of the data they will be run
> against, one of the reasons I like the Maxent approach. I am not sure
> attempting to provide models will bear much fruit other than users will no
> longer be afraid of the licensing issues associated with using them in
> commercial systems. I do strongly think we should provide a modelbuilding
> framework (that calls the training api) and a default impl.
> Coincidentally....I have been building a framework and impl over the last
> few months that creates models based on seeding an iterative process with
> known entities and iterating through a set of supplied sentences to
> recursively create annotations, write them, create a maxentmodel, load the
> model, create more annotations based on the results (there is a validation
> object involved), and so on.... With this method I was able to create an
> NER model for people's names against a 200K sentence corpus that returns
> acceptable results just by starting with a list of five highly unambiguous
> names. I will propose the framework in more detail in the coming days and
> supply my impl if everyone is interested.
> As for the initial question, I would like to see OpenNLP provide a
> framework for rapidly/semi-automatically building models out of user data,
> and also performing entity resolution across documents, in order to assign
> a probability to whether the "Bob" in one document is the same as "Bob" in
> another.
> MG
>
>
> On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz
> <sc...@cs.washington.edu>wrote:
>
> > Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> > tagger, and tokenizer.  We're grateful for a high performance library
> > with an Apache license, but one of our greatest complaints is the
> > quality of the models.  Yes--we're aware we can train our own--but
> > most people are looking for something that is good enough out of the
> > box (we aim for this with out products).  I'm not surprised that
> > volunteer engineers don't want to spend their time annotating data ;-)
> >
> > I'm curious what other people see as the biggest shortcomings for Open
> > NLP or the most important next steps for OpenNlp.  I may have an
> > opportunity to contribute to the project and I'm trying to figure out
> > where the community thinks the biggest impact could be made.
> >
> > Peace.
> > Michael Schmitz
> >
>

Re: Next Steps for OpenNLP

Posted by Ant B <am...@gmail.com>.
Hi community,

Just to flag active interest in the coreference module.

It plays an important role in my team's pipeline - we are interested in relation extraction.  The module, in my view, is a strong advantage of the excellent OpenNLP project.  Agree that it feels a little neglected compared to the rest of the project, likely due to complexity. To discard it as abandon-ware would be a sad loss.  I've commented on this list previously about my experience getting the module working (I don't claim to be an expert!), so it seems there is other active interest too.

A cross-document entity disambiguation tool would indeed be an awesome addition!

James - thanks for your guidance on efforts to navigate copyright issues and build up-to-date models!

Thanks,

Ant


On Oct 1, 2013, at 9:00 PM, James Kosin <ja...@gmail.com> wrote:

> Mark & Michael & Others,
> 
> The current models where trained using old annotated news articles and are really used as useful examples.  They were never meant to be complete or otherwise in training.  The copyright issues are complicated but in a nutshell the owners of the corpuses that where used allow us to use the generated data for educational and research purposes only in most cases.  This means that commercial use is strictly forbidden by the copyright holders, never mind the fact you can't generate the original or produce the material from the models.  I know it sounds like an odd copyright, and some models may be a bit more leanient on the details of the copyright.
> 
> The corpuses where generated by people doing research and other tasks via the CONLL and other projects to train models to detect POS, NER, and other types of pre-processing of textual data over the years.  Most of these have continual yearly or biyearly projects to do additional work in these areas.  OpenNLP isn't directly involved in these (to my knowledge... I'm sure to get some bad press on this).  But, the goals of the project are to get a set of training and test data to experiment and research on different model approaches to see if a best model for the type of parsing/processing/understanding, etc. of the textual data can be found for the situation.
> 
> With an APACHE license, we have to be able to distribute the sources for the models to be able to align with the license... as such, we have other side projects setup to research and develop an easier method to generate and tag the data for the various types of corpus data we need to train against.  But, the catch is the data we gather needs to be FREE of any legal copyrights... we have found several avenues that seem promissing in this area.
> https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations
> 
> We have sources in the sandbox for this and other works in the opennlp project as well... in progress for the OpenNLP project.
>    http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
>    https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]
> 
> By all means please get involved!
> We need people who can read and annotate various languages.  We need people who can test models.  We need people who can come up with new ideas.  We have other projects in WIKI for adding support for other model types other than just maxent.  There is also another for using SORA as the language.
> 
> Thanks for lisenning to me,
> James Kosin
> 


Re: Next Steps for OpenNLP

Posted by William Colen <wi...@gmail.com>.
If you like you can take a look at the chapters 6.6 and 6.8 of
http://www.teses.usp.br/teses/disponiveis/45/45134/tde-02052013-135414/publico/WilliamColen_Dissertation.pdf

There I wrote about my experience tuning Portuguese models for POS Tagger
and Chunker.
I tried out many OpenNLP configurations and measured their impact both
using the performance monitor and my final application itself.


2013/10/7 Jörn Kottmann <ko...@gmail.com>

> On 10/07/2013 11:00 PM, Michael Schmitz wrote:
>
>> Do you know how many sentences/tokens were annotated for the OpenNLP
>> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
>> number of annotations vs performance?
>>
>
> If the model gets bigger the computations get more complex, but as far as
> I know
> the effect of the model not fitting anymore in the CPU cache is much more
> significant then
> that. I am using hash based int features to reduce the memory footprint in
> the name finder.
>
> I don't have much experience with the Chunker or Pos Tagger in regards to
> performance, but
> it should be easy to do a series of tests, the command line tools have
> built in performance monitoring.
>
> Jörn
>

Re: Next Steps for OpenNLP

Posted by William Colen <wi...@gmail.com>.
Yes. Both POS Tagger and Chunker were trained with Bosque, which includes
4,212 sentences.

http://www.linguateca.pt/floresta/info_floresta_English.html


Regarding memory usage -
For the POS Tagger, the default dictionary is loaded to a hash table. For
my application I implemented a way to store the lexeme dictionary somewhere
else, like a database. I contributed to OpenNLP the code that allows
extending the default dictionary, but I could not contribute a different
implementation yet.
Also, I did a lot of experiments comparing model effectiveness and model
size, but most of them were not included in the text I pointed, but a few
chapters discuss that a little, like the Sentence Detector (6.2).



2013/10/10 Michael Schmitz <sc...@cs.washington.edu>

> Hi William--thanks for the pointer.  Do you know the size of your
> training sets?  I did not see that in the chapters you pointed me to.
>
> On Mon, Oct 7, 2013 at 3:48 PM, William Colen <wi...@gmail.com>
> wrote:
> > Actually, I measured the model effectiveness, not the memory x
> performance.
> >
> >
> >
> >
> > 2013/10/7 Michael Schmitz <sc...@cs.washington.edu>
> >
> >> Hi Jorn, let me be more precise.  Do you have a notion of how the
> >> precision-recall curve (AUC) changes as a function of the number of
> >> annotations?  I'm curious how many annotations are needed for a model
> >> with reasonable precision-recall AUC and reasonable performance
> >> (memory and speed).
> >>
> >> Peace.  Michael
> >>
> >> On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> >> > On 10/07/2013 11:00 PM, Michael Schmitz wrote:
> >> >>
> >> >> Do you know how many sentences/tokens were annotated for the OpenNLP
> >> >> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
> >> >> number of annotations vs performance?
> >> >
> >> >
> >> > If the model gets bigger the computations get more complex, but as far
> >> as I
> >> > know
> >> > the effect of the model not fitting anymore in the CPU cache is much
> more
> >> > significant then
> >> > that. I am using hash based int features to reduce the memory
> footprint
> >> in
> >> > the name finder.
> >> >
> >> > I don't have much experience with the Chunker or Pos Tagger in
> regards to
> >> > performance, but
> >> > it should be easy to do a series of tests, the command line tools have
> >> built
> >> > in performance monitoring.
> >> >
> >> > Jörn
> >>
>

Re: Next Steps for OpenNLP

Posted by Michael Schmitz <sc...@cs.washington.edu>.
Hi William--thanks for the pointer.  Do you know the size of your
training sets?  I did not see that in the chapters you pointed me to.

On Mon, Oct 7, 2013 at 3:48 PM, William Colen <wi...@gmail.com> wrote:
> Actually, I measured the model effectiveness, not the memory x performance.
>
>
>
>
> 2013/10/7 Michael Schmitz <sc...@cs.washington.edu>
>
>> Hi Jorn, let me be more precise.  Do you have a notion of how the
>> precision-recall curve (AUC) changes as a function of the number of
>> annotations?  I'm curious how many annotations are needed for a model
>> with reasonable precision-recall AUC and reasonable performance
>> (memory and speed).
>>
>> Peace.  Michael
>>
>> On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>> > On 10/07/2013 11:00 PM, Michael Schmitz wrote:
>> >>
>> >> Do you know how many sentences/tokens were annotated for the OpenNLP
>> >> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
>> >> number of annotations vs performance?
>> >
>> >
>> > If the model gets bigger the computations get more complex, but as far
>> as I
>> > know
>> > the effect of the model not fitting anymore in the CPU cache is much more
>> > significant then
>> > that. I am using hash based int features to reduce the memory footprint
>> in
>> > the name finder.
>> >
>> > I don't have much experience with the Chunker or Pos Tagger in regards to
>> > performance, but
>> > it should be easy to do a series of tests, the command line tools have
>> built
>> > in performance monitoring.
>> >
>> > Jörn
>>

Re: Next Steps for OpenNLP

Posted by William Colen <wi...@gmail.com>.
Actually, I measured the model effectiveness, not the memory x performance.




2013/10/7 Michael Schmitz <sc...@cs.washington.edu>

> Hi Jorn, let me be more precise.  Do you have a notion of how the
> precision-recall curve (AUC) changes as a function of the number of
> annotations?  I'm curious how many annotations are needed for a model
> with reasonable precision-recall AUC and reasonable performance
> (memory and speed).
>
> Peace.  Michael
>
> On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> > On 10/07/2013 11:00 PM, Michael Schmitz wrote:
> >>
> >> Do you know how many sentences/tokens were annotated for the OpenNLP
> >> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
> >> number of annotations vs performance?
> >
> >
> > If the model gets bigger the computations get more complex, but as far
> as I
> > know
> > the effect of the model not fitting anymore in the CPU cache is much more
> > significant then
> > that. I am using hash based int features to reduce the memory footprint
> in
> > the name finder.
> >
> > I don't have much experience with the Chunker or Pos Tagger in regards to
> > performance, but
> > it should be easy to do a series of tests, the command line tools have
> built
> > in performance monitoring.
> >
> > Jörn
>

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/08/2013 12:45 AM, Michael Schmitz wrote:
> Hi Jorn, let me be more precise.  Do you have a notion of how the
> precision-recall curve (AUC) changes as a function of the number of
> annotations?  I'm curious how many annotations are needed for a model
> with reasonable precision-recall AUC and reasonable performance
> (memory and speed).

No, I don't, you need to write a class which trains and test many times with
different amounts of training data.

Maybe we should make this use case really easy and add some kind of 
experimenter
support to our components. The experimenter could take a class which 
provides configuration
for the trainer depending on the iteration, the results of one iteration 
could be recorded in a csv or
text file which can then later be analyzed with tools like Excel, R, etc.

I often run the name finder with slightly modified feature generation to 
find a setup which works
with my data, this could probably be automated quite a bit.

Jörn

Re: Next Steps for OpenNLP

Posted by Michael Schmitz <sc...@cs.washington.edu>.
Hi Jorn, let me be more precise.  Do you have a notion of how the
precision-recall curve (AUC) changes as a function of the number of
annotations?  I'm curious how many annotations are needed for a model
with reasonable precision-recall AUC and reasonable performance
(memory and speed).

Peace.  Michael

On Mon, Oct 7, 2013 at 3:29 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/07/2013 11:00 PM, Michael Schmitz wrote:
>>
>> Do you know how many sentences/tokens were annotated for the OpenNLP
>> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
>> number of annotations vs performance?
>
>
> If the model gets bigger the computations get more complex, but as far as I
> know
> the effect of the model not fitting anymore in the CPU cache is much more
> significant then
> that. I am using hash based int features to reduce the memory footprint in
> the name finder.
>
> I don't have much experience with the Chunker or Pos Tagger in regards to
> performance, but
> it should be easy to do a series of tests, the command line tools have built
> in performance monitoring.
>
> Jörn

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/07/2013 11:00 PM, Michael Schmitz wrote:
> Do you know how many sentences/tokens were annotated for the OpenNLP
> POS and CHUNK models?  Do you have an idea of the "sweet spot" for
> number of annotations vs performance?

If the model gets bigger the computations get more complex, but as far 
as I know
the effect of the model not fitting anymore in the CPU cache is much 
more significant then
that. I am using hash based int features to reduce the memory footprint 
in the name finder.

I don't have much experience with the Chunker or Pos Tagger in regards 
to performance, but
it should be easy to do a series of tests, the command line tools have 
built in performance monitoring.

Jörn

Re: Next Steps for OpenNLP

Posted by Michael Schmitz <sc...@cs.washington.edu>.
> The current models where trained using old annotated news articles and are really used as useful examples.  They were never meant to be complete or otherwise in training.  The copyright issues are complicated but in a nutshell the owners of the corpuses that where used allow us to use the generated data for educational and research purposes only in most cases.  This means that commercial use is strictly forbidden by the copyright holders, never mind the fact you can't generate the original or produce the material from the models.  I know it sounds like an odd copyright, and some models may be a bit more leanient on the details of the copyright.

Do you know how many sentences/tokens were annotated for the OpenNLP
POS and CHUNK models?  Do you have an idea of the "sweet spot" for
number of annotations vs performance?

Peace.  Michael

On Tue, Oct 1, 2013 at 8:00 PM, James Kosin <ja...@gmail.com> wrote:
> Mark & Michael & Others,
>
> The current models where trained using old annotated news articles and are
> really used as useful examples.  They were never meant to be complete or
> otherwise in training.  The copyright issues are complicated but in a
> nutshell the owners of the corpuses that where used allow us to use the
> generated data for educational and research purposes only in most cases.
> This means that commercial use is strictly forbidden by the copyright
> holders, never mind the fact you can't generate the original or produce the
> material from the models.  I know it sounds like an odd copyright, and some
> models may be a bit more leanient on the details of the copyright.
>
> The corpuses where generated by people doing research and other tasks via
> the CONLL and other projects to train models to detect POS, NER, and other
> types of pre-processing of textual data over the years.  Most of these have
> continual yearly or biyearly projects to do additional work in these areas.
> OpenNLP isn't directly involved in these (to my knowledge... I'm sure to get
> some bad press on this).  But, the goals of the project are to get a set of
> training and test data to experiment and research on different model
> approaches to see if a best model for the type of
> parsing/processing/understanding, etc. of the textual data can be found for
> the situation.
>
> With an APACHE license, we have to be able to distribute the sources for the
> models to be able to align with the license... as such, we have other side
> projects setup to research and develop an easier method to generate and tag
> the data for the various types of corpus data we need to train against.
> But, the catch is the data we gather needs to be FREE of any legal
> copyrights... we have found several avenues that seem promissing in this
> area.
> https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations
>
> We have sources in the sandbox for this and other works in the opennlp
> project as well... in progress for the OpenNLP project.
>     http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
>     https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]
>
> By all means please get involved!
> We need people who can read and annotate various languages.  We need people
> who can test models.  We need people who can come up with new ideas.  We
> have other projects in WIKI for adding support for other model types other
> than just maxent.  There is also another for using SORA as the language.
>
> Thanks for lisenning to me,
> James Kosin
>

Re: Next Steps for OpenNLP

Posted by James Kosin <ja...@gmail.com>.
Mark & Michael & Others,

The current models where trained using old annotated news articles and 
are really used as useful examples.  They were never meant to be 
complete or otherwise in training.  The copyright issues are complicated 
but in a nutshell the owners of the corpuses that where used allow us to 
use the generated data for educational and research purposes only in 
most cases.  This means that commercial use is strictly forbidden by the 
copyright holders, never mind the fact you can't generate the original 
or produce the material from the models.  I know it sounds like an odd 
copyright, and some models may be a bit more leanient on the details of 
the copyright.

The corpuses where generated by people doing research and other tasks 
via the CONLL and other projects to train models to detect POS, NER, and 
other types of pre-processing of textual data over the years.  Most of 
these have continual yearly or biyearly projects to do additional work 
in these areas.  OpenNLP isn't directly involved in these (to my 
knowledge... I'm sure to get some bad press on this).  But, the goals of 
the project are to get a set of training and test data to experiment and 
research on different model approaches to see if a best model for the 
type of parsing/processing/understanding, etc. of the textual data can 
be found for the situation.

With an APACHE license, we have to be able to distribute the sources for 
the models to be able to align with the license... as such, we have 
other side projects setup to research and develop an easier method to 
generate and tag the data for the various types of corpus data we need 
to train against.  But, the catch is the data we gather needs to be FREE 
of any legal copyrights... we have found several avenues that seem 
promissing in this area.
https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations

We have sources in the sandbox for this and other works in the opennlp 
project as well... in progress for the OpenNLP project.
     http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
     https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]

By all means please get involved!
We need people who can read and annotate various languages.  We need 
people who can test models.  We need people who can come up with new 
ideas.  We have other projects in WIKI for adding support for other 
model types other than just maxent.  There is also another for using 
SORA as the language.

Thanks for lisenning to me,
James Kosin


Re: Next Steps for OpenNLP

Posted by Mark G <gi...@gmail.com>.
I think it has potential. Here is a better description of what it does.
The user story would be something like this: *"As a user, I need to be able
to rapidly create and refine models for extracting Named Entities from my
particular data so that I can constantly improve the results of my pipeline"
*
The processing flow of the tool is this
User supplies a set of sentences via implementing a SentenceProvider
interface
User supplies a validation layer via implementing a EntityValidator
interface
User supplies a location to write the annotated sentences via a
AnnotatedSentenceWriter interface
User supplies a list of seed entities via a KnownEntityProvider interface
User passes these interfaces along with a number of iterations into the
SemiSupervisedModelBuilder interface impl.
I wrote prototype implementation of each (it's rough at this point), sorry
or the extremely long post.


here are the interfaces:
public interface SemiSupervisedModelBuilder {
  void build(SentenceProvider sentenceProvider, KnownEntityProvider
knownQuantityProvider,EntityValidator badEntityProvider,
AnnotatedSentenceWriter annSentenceWriter,  Integer iterations);
}

 public interface SentenceProvider {
  Set<String> getSentences();
}

public interface KnownEntityProvider {

  Set<String> getKnownEntities();
  void addKnownEntity(String unambiguousEntity);
  String getKnownEntitiesType();
}

 public interface EntityValidator {
  Set<String> getBlacklist();
  Boolean isValidEntity(String token);
  Boolean isValidEntity(String token, double prob);
  Boolean isValidEntity(String token,Span namedEntity, String[] words,
String[] posWhiteList, String[] pos);
}

public interface AnnotatedSentenceWriter {

  void write(List<String> annotatedSentences);
  void setFilePath(String path);
  String getFilePath();
}
*
*
/////////////here is the impl that controls the flow


public class SemiSupervisedModelBuilderImpl implements
SemiSupervisedModelBuilder {

  public static void main(String[] args) {

    SemiSupervisedModelBuilder builder = new
SemiSupervisedModelBuilderImpl();
    SentenceProvider sp = new MySQLSentenceProviderImpl();
    EntityValidator kbe = new GenericEntityValidatorImpl();
    KnownEntityProvider kqp = new GenericKnownEntityProvider();
    AnnotatedSentenceWriter asw = new GenericAnnotatedSentenceWriter();

    builder.build(sp, kqp, kbe, asw, 2);
  }
  TokenizerModel tm;
  TokenizerME wordBreaker;
  TokenNameFinderModel nerModel;
  NameFinderME nameFinder;

  @Override
  public void build(SentenceProvider sentenceProvider, KnownEntityProvider
knownQuantityProvider,
          EntityValidator knownBadEntityProvider, AnnotatedSentenceWriter
annSentenceWriter, Integer enrighmentIterations) {

 Set<String> sentences = *sentenceProvider.getSentences();*
    List<String> annotatedSentences = new ArrayList<>();
    try {
      for (int iters = 0; iters < enrighmentIterations; iters++) {
        int counter1 = 0;
        System.out.println("-----------------iteration : " + iters);
       * for (String sentence : sentences)* {
          counter1++;
          if (counter1 % 1000 == 0) {
            System.out.println("sentence " + counter1 + " of iter " +
iters);
          }
          for (String known *: knownQuantityProvider.getKnownEntities()*) {
            if (sentence.contains(known)) {
              String annSent = sentence.replace(known, " <START:" +
knownQuantityProvider.getKnownEntitiesType() + "> " + known.trim() + "
<END> ");
              if (!annotatedSentences.contains(annSent)) {
                //   System.out.println("Created new annotation");
                annotatedSentences.add(annSent);
              } else {
                // System.out.println("\tannotation already exists");
              }
            }
          }
        }
        System.out.println("writing" + annotatedSentences.size() + "
annotations");
    *    annSentenceWriter.write(annotatedSentences);*
       * buildmodel(annSentenceWriter.getFilePath());*
        String modelPath = "c:\\temp\\opennlpmodels\\";
        InputStream stream = new FileInputStream(new File(modelPath +
"en-token.zip"));

        tm = new TokenizerModel(stream);
        wordBreaker = new TokenizerME(tm);
//load the model we just made
        nerModel = new TokenNameFinderModel(new FileInputStream(new
File(modelPath + "en-ner-person.train.model")));
        nameFinder = new NameFinderME(nerModel);
        int counter = 0;
        *for (String sentence : sentences) {*
          counter++;
          if (counter % 1000 == 0) {
            System.out.println("sentence " + counter + " of iter " + iters);
            nameFinder.clearAdaptiveData();
          }
          String[] stringTokens = wordBreaker.tokenize(sentence);
          Span[] spans = nameFinder.find(stringTokens);
          double[] probs = nameFinder.probs();
          if (spans.length > 0) {
            //  System.out.println(probs[0]);
            for (String token : Span.spansToStrings(spans, stringTokens)) {

              // if
(!knownQuantityProvider.*getKnownEntities*().contains(token))
{
              if (!knownBadEntityProvider.*isValidEntity*(token)) {
                knownQuantityProvider.*addKnownEntity*(token);
                String annSent = sentence.replace(token, " <START:" +
knownQuantityProvider.*getKnownEntitiesType*() + "> " + token.trim() + "
<END> ");
                *annotatedSentences.add(annSent);*
                System.out.println("NER: " + token);
//              } else {
//                System.out.println("BAD ENTITY: " + token);
//              }
              }
            }
          }
        }
//this set grows and can be validated via the blacklist and other means
        for (String a : knownQuantityProvider.getKnownEntities()) {
          System.out.println("knowns: " + a);
        }
      }
      annSentenceWriter.write(annotatedSentences);
      System.err.println("BUILDING FINAL MODEL");
      buildmodel(annSentenceWriter.getFilePath());
      System.err.println("FINAL MODEL COMPLETE");
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  public void buildmodel(String path) throws Exception {
    System.out.println("reading training data...");
    Charset charset = Charset.forName("UTF-8");
    ObjectStream<String> lineStream =
            new PlainTextByLineStream(new
FileInputStream("c:\\temp\\opennlpmodels\\en-ner-person.train"), charset);
    ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);
    System.out.println("\tgenerating model...");
    TokenNameFinderModel model;
    model = NameFinderME.train("en", "person", sampleStream, null);
    sampleStream.close();
    OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(new File(path + ".model")));
    model.serialize(modelOut);
    if (modelOut != null) {
      modelOut.close();
    }
  }


}

Here is the KnownEntityProvider Impl. This is essentially the start point
for iteratively creating the model

public class GenericKnownEntityProvider implements KnownEntityProvider{
     Set<String> ret = new HashSet<>();

      @Override
      public Set<String> getKnownEntities() {
        if (ret.isEmpty()) {
          ret.add("Barack Obama");
          ret.add("Mitt Romney");
          ret.add("John Doe");
          ret.add("Bill Gates");
          ret.add("Nguyen Tan Dung");
          ret.add("Hassanal Bolkiah");
          ret.add("Bashar al-Assad");
          ret.add("Faysal Khabbaz Hamou");
          ret.add("Dr Talwar");
        }
        return ret;
      }

      @Override
      public String getKnownEntitiesType() {
        return "person";
      }

      @Override
      public void addKnownEntity(String unambiguousEntity) {
        ret.add(unambiguousEntity);
      }
}

here is my simple entity validator... the badentities set is what users can
add to in order to iteatively improve the resulting model...
the user can validate any way they want, the overloads make that obvious I
hoped

public class GenericEntityValidatorImpl implements EntityValidator {

  private Set<String> badentities = new HashSet<>();
  private final double MIN_SCORE_FOR_TRAINING = 0.95d;

  @Override
  public Set<String> getBlacklist() {
    badentities.add(".");
    badentities.add("-");
    badentities.add(",");
    badentities.add(";");
    badentities.add("the");
    badentities.add("that");
    badentities.add("several");
    badentities.add("model");
    badentities.add("our");
    badentities.add("are");
    badentities.add("in");
    badentities.add("are");
    badentities.add("at");
    badentities.add("is");
    badentities.add("for");
    badentities.add("the");
    badentities.add("during");
    badentities.add("south");
    badentities.add("from");
    badentities.add("recounts");
    badentities.add("wissenschaftliches");
    badentities.add("if");
    badentities.add("security");
    badentities.add("denouncing");
    badentities.add("writes");
    badentities.add("but");
    badentities.add("operation");
    badentities.add("adds");
    badentities.add("Above");
    badentities.add("but");
    badentities.add("RIP");
    badentities.add("on");
    badentities.add("no");
    badentities.add("agrees");
    badentities.add("year");
    badentities.add("for");
    badentities.add("you");
    badentities.add("red");
    badentities.add("added");
    badentities.add("hello");
    badentities.add("around");
    badentities.add("has");
    badentities.add("turn");
    badentities.add("surrounding");
    badentities.add("\" No");
    badentities.add("aug.");
    badentities.add("or");
    badentities.add("quips");
    badentities.add("september");
    badentities.add("[mr");
    badentities.add("diseases");
    badentities.add("when");
    badentities.add("bbc");
    badentities.add(":\"");
    badentities.add("dr");
    badentities.add("baby");
    badentities.add("on");
    badentities.add("route");
    badentities.add("'");
    badentities.add("\"");
    badentities.add("a");
    badentities.add("her");
    badentities.add("'");
    badentities.add("\"");
    badentities.add("two");
    badentities.add("that");
    badentities.add(":");
    badentities.add("one");
    return badentities;
  }

  @Override
  public Boolean isValidEntity(String token) {
    if (badentities.isEmpty()) {
      getBlacklist();
    }
    String[] tokens = token.toLowerCase().split(" ");
    if (tokens.length >= 2) {
      for (String t : tokens) {
        if (badentities.contains(t.trim())) {
          System.out.println("bad token : " + token);
          return false;

        }
      }
    } else {
      System.out.println("bad token : " + token);
      return false;
    }

    Pattern p = Pattern.compile("[^a-z ]", Pattern.CASE_INSENSITIVE |
Pattern.MULTILINE);
    if (p.matcher(token).find()) {
      System.out.println("hit on [^a-z\\- ]  :  " + token);
      if (!token.toLowerCase().matches(".*[a-z]\\-[a-z].*")) {
        System.out.println("bad token : " + token);
        return false;
      } else {
        System.out.println("false pos : " + token);
      }
    }
    Boolean b = true;
    if (badentities.contains(token.toLowerCase())) {
      System.out.println("bad token : " + token);
      b = false;
    }
    return b;
  }

  @Override
  public Boolean isValidEntity(String token, double prob) {
    Boolean b = false;
    if (prob < MIN_SCORE_FOR_TRAINING) {
      b = false;
    } else {
      b = isValidEntity(token);
    }
    return b;
  }

  @Override
  public Boolean isValidEntity(String token, Span namedEntity, String[]
words, String[] posWhiteList, String[] pos) {

    boolean b = isValidEntity(token);
    if (!b) {
      return b;
    }
    for(int start = namedEntity.getStart(); start < namedEntity.getEnd();
start++){
     for(String ps : pos){
       if(!ps.equals(posWhiteList[start])){
         return false;
       }
     }
    }
    return b;
  }
}

the Annotated Sentence writer dictates where to write the output sentences.
This is great is someone is doing this in a distributed way (like in
Hadoop), this could write out to HBase of Hdfs of something where it could
be crowdsourced or whatever...


public class GenericAnnotatedSentenceWriter implements
AnnotatedSentenceWriter {

  private String path = "c:\\temp\\opennlpmodels\\en-ner-person.train";

  @Override
  public void write(List<String> sentences) {
    try {
      FileWriter writer = new FileWriter(this.getFilePath(), false);

      for (String s : sentences) {
        writer.write(s.trim() + "\n");
      }
      writer.close();
    } catch (IOException ex) {

    }
  }

  @Override
  public void setFilePath(String path) {
    this.path = path;
  }

  @Override
  public String getFilePath() {
    return path;
  }
}

if you made it this far down the email please let me know what you think. I
believe it has potential.

thanks
MG



On Thu, Oct 3, 2013 at 4:02 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 10/02/2013 02:06 AM, Mark G wrote:
>
>> I've been using OpenNLP for a few years and I find the best results occur
>> when the models are generated using samples of the data they will be run
>> against, one of the reasons I like the Maxent approach. I am not sure
>> attempting to provide models will bear much fruit other than users will no
>> longer be afraid of the licensing issues associated with using them in
>> commercial systems. I do strongly think we should provide a modelbuilding
>> framework (that calls the training api) and a default impl.
>> Coincidentally....I have been building a framework and impl over the last
>> few months that creates models based on seeding an iterative process with
>> known entities and iterating through a set of supplied sentences to
>> recursively create annotations, write them, create a maxentmodel, load the
>> model, create more annotations based on the results (there is a validation
>> object involved), and so on.... With this method I was able to create an
>> NER model for people's names against a 200K sentence corpus that returns
>> acceptable results just by starting with a list of five highly unambiguous
>> names. I will propose the framework in more detail in the coming days and
>> supply my impl if everyone is interested.
>> As for the initial question, I would like to see OpenNLP provide a
>> framework for rapidly/semi-automatically building models out of user data,
>> and also performing entity resolution across documents, in order to assign
>> a probability to whether the "Bob" in one document is the same as "Bob" in
>> another.
>>
>>
> Sounds very interesting. The sentence-wise training data which is produced
> this way could
> also be combined with existing training data, or just be used to bootstrap
> a model to more
> efficiently label data with a document-level annotation tool.
>
> Another aspect is that this tool might be good at detecting mistakes in
> existing training data.
>
> Jörn
>
>
>

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/02/2013 02:06 AM, Mark G wrote:
> I've been using OpenNLP for a few years and I find the best results occur
> when the models are generated using samples of the data they will be run
> against, one of the reasons I like the Maxent approach. I am not sure
> attempting to provide models will bear much fruit other than users will no
> longer be afraid of the licensing issues associated with using them in
> commercial systems. I do strongly think we should provide a modelbuilding
> framework (that calls the training api) and a default impl.
> Coincidentally....I have been building a framework and impl over the last
> few months that creates models based on seeding an iterative process with
> known entities and iterating through a set of supplied sentences to
> recursively create annotations, write them, create a maxentmodel, load the
> model, create more annotations based on the results (there is a validation
> object involved), and so on.... With this method I was able to create an
> NER model for people's names against a 200K sentence corpus that returns
> acceptable results just by starting with a list of five highly unambiguous
> names. I will propose the framework in more detail in the coming days and
> supply my impl if everyone is interested.
> As for the initial question, I would like to see OpenNLP provide a
> framework for rapidly/semi-automatically building models out of user data,
> and also performing entity resolution across documents, in order to assign
> a probability to whether the "Bob" in one document is the same as "Bob" in
> another.
>

Sounds very interesting. The sentence-wise training data which is 
produced this way could
also be combined with existing training data, or just be used to 
bootstrap a model to more
efficiently label data with a document-level annotation tool.

Another aspect is that this tool might be good at detecting mistakes in 
existing training data.

Jörn



Re: Next Steps for OpenNLP

Posted by Mark G <gi...@gmail.com>.
I've been using OpenNLP for a few years and I find the best results occur
when the models are generated using samples of the data they will be run
against, one of the reasons I like the Maxent approach. I am not sure
attempting to provide models will bear much fruit other than users will no
longer be afraid of the licensing issues associated with using them in
commercial systems. I do strongly think we should provide a modelbuilding
framework (that calls the training api) and a default impl.
Coincidentally....I have been building a framework and impl over the last
few months that creates models based on seeding an iterative process with
known entities and iterating through a set of supplied sentences to
recursively create annotations, write them, create a maxentmodel, load the
model, create more annotations based on the results (there is a validation
object involved), and so on.... With this method I was able to create an
NER model for people's names against a 200K sentence corpus that returns
acceptable results just by starting with a list of five highly unambiguous
names. I will propose the framework in more detail in the coming days and
supply my impl if everyone is interested.
As for the initial question, I would like to see OpenNLP provide a
framework for rapidly/semi-automatically building models out of user data,
and also performing entity resolution across documents, in order to assign
a probability to whether the "Bob" in one document is the same as "Bob" in
another.
MG


On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz
<sc...@cs.washington.edu>wrote:

> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> tagger, and tokenizer.  We're grateful for a high performance library
> with an Apache license, but one of our greatest complaints is the
> quality of the models.  Yes--we're aware we can train our own--but
> most people are looking for something that is good enough out of the
> box (we aim for this with out products).  I'm not surprised that
> volunteer engineers don't want to spend their time annotating data ;-)
>
> I'm curious what other people see as the biggest shortcomings for Open
> NLP or the most important next steps for OpenNlp.  I may have an
> opportunity to contribute to the project and I'm trying to figure out
> where the community thinks the biggest impact could be made.
>
> Peace.
> Michael Schmitz
>

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/01/2013 05:01 PM, Michael Schmitz wrote:
> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> tagger, and tokenizer.  We're grateful for a high performance library
> with an Apache license, but one of our greatest complaints is the
> quality of the models.  Yes--we're aware we can train our own--but
> most people are looking for something that is good enough out of the
> box (we aim for this with out products).  I'm not surprised that
> volunteer engineers don't want to spend their time annotating data;-)

OpenNLP addressed this issue partly with the formats package, quite some
existing corpora can now be used to create OpenNLP models, I personally 
don't
feel its worth going through all the license issues to redistribute 
these models,
and rather think we should make sure they can be created easily with the 
trainer
tools we have.

Jörn

Re: Next Steps for OpenNLP

Posted by Mark G <gi...@gmail.com>.
I've been using OpenNLP for a few years and I find the best results occur
when the models are generated using samples of the data they will be run
against, one of the reasons I like the Maxent approach. I am not sure
attempting to provide models will bear much fruit other than users will no
longer be afraid of the licensing issues associated with using them in
commercial systems. I do strongly think we should provide a modelbuilding
framework (that calls the training api) and a default impl.
Coincidentally....I have been building a framework and impl over the last
few months that creates models based on seeding an iterative process with
known entities and iterating through a set of supplied sentences to
recursively create annotations, write them, create a maxentmodel, load the
model, create more annotations based on the results (there is a validation
object involved), and so on.... With this method I was able to create an
NER model for people's names against a 200K sentence corpus that returns
acceptable results just by starting with a list of five highly unambiguous
names. I will propose the framework in more detail in the coming days and
supply my impl if everyone is interested.
As for the initial question, I would like to see OpenNLP provide a
framework for rapidly/semi-automatically building models out of user data,
and also performing entity resolution across documents, in order to assign
a probability to whether the "Bob" in one document is the same as "Bob" in
another.
MG


On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz
<sc...@cs.washington.edu>wrote:

> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> tagger, and tokenizer.  We're grateful for a high performance library
> with an Apache license, but one of our greatest complaints is the
> quality of the models.  Yes--we're aware we can train our own--but
> most people are looking for something that is good enough out of the
> box (we aim for this with out products).  I'm not surprised that
> volunteer engineers don't want to spend their time annotating data ;-)
>
> I'm curious what other people see as the biggest shortcomings for Open
> NLP or the most important next steps for OpenNlp.  I may have an
> opportunity to contribute to the project and I'm trying to figure out
> where the community thinks the biggest impact could be made.
>
> Peace.
> Michael Schmitz
>

Re: Next Steps for OpenNLP

Posted by John Stewart <ca...@gmail.com>.
This seems like a fairly big deal to me.  I've recently switched to using
Freeling in my dissertation work because of this -- I wanted to use the
same tool for a basic pipeline that gave me coverage of English and
Spanish.  Now, we could in principle use Freeling to generate a seed corpus
for other languages, then semi-supervise up from there, but we would then
also inherit Freeling's errors.

jds


On Wed, Oct 2, 2013 at 3:33 AM, Chris Collins <ch...@yahoo.com>wrote:

> I am going to make a really naive comment / idea / input.  (there are a
> lot of IF's in this post so I apologize in advance)
>
> Its my observation that there are lots of companies out there that really
> wouldn't mind OpenNLP having better coverage of POS tagging and chunking in
> a whole assortment of languages.  Its not a long term competitive advantage
> to do it by themselves.  They also probably have neither the skills or time
> to make it happen (without pooling).  I have worked for three companies so
> far that fit into that category one I got very close to just paying for the
> labeling and donating the content.....clearly it didnt happen.
>
> Coverage in this part varies by company but
>
> As I see it:
>
> 1) Part of the problem is the labeling of the content.  What if we were
> able to turk this?  It may require breaking down the labeling process into
> a whole bunch of sub tasks.  Further it would require probably finding a
> subset of turkers capable of aiding in labeling for this type of advanced
> task.  I am a fan of companies like CrowdFlower that build on top of amazon
> mechanical turk and have pre-validated turkers that are known to perform
> with certain task styles.
>
> 2) assuming labeling to a quality (enough level) could be achieved with
> (1) could we have a fund / charity / kickstarter to pay for this labeling.
>  Perhaps the funding is split up by language so for instance companies
> could vote with their money on what they need to get fleshed out.
>
> of course 1 + 2 dont solve the complete picture.
>
> Thoughts? heckles?
>
> I actually work for a large corp that I can argue we need to put into the
> pot for several european languages and a couple asian.
>
> C
>
>
>
>
> On Oct 1, 2013, at 11:58 PM, Thomas Zastrow <po...@thomas-zastrow.de>
> wrote:
>
> > Dear all,
> >
> > Some of you mentioned already the Brat tool, so let me point you to
> WebAnno. It is based on Brat, but has some more functionality like for
> example extensions for crowdsourcing:
> >
> > http://code.google.com/p/webanno/
> >
> > Best,
> >
> > Tom
> >
> >
> >
> >
> > Am 01.10.2013 17:01, schrieb Michael Schmitz:
> >> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> >> tagger, and tokenizer.  We're grateful for a high performance library
> >> with an Apache license, but one of our greatest complaints is the
> >> quality of the models.  Yes--we're aware we can train our own--but
> >> most people are looking for something that is good enough out of the
> >> box (we aim for this with out products).  I'm not surprised that
> >> volunteer engineers don't want to spend their time annotating data ;-)
> >>
> >> I'm curious what other people see as the biggest shortcomings for Open
> >> NLP or the most important next steps for OpenNlp.  I may have an
> >> opportunity to contribute to the project and I'm trying to figure out
> >> where the community thinks the biggest impact could be made.
> >>
> >> Peace.
> >> Michael Schmitz
> >
>
>

Re: Next Steps for OpenNLP

Posted by Chris Collins <ch...@yahoo.com>.
I am going to make a really naive comment / idea / input.  (there are a lot of IF's in this post so I apologize in advance)

Its my observation that there are lots of companies out there that really wouldn't mind OpenNLP having better coverage of POS tagging and chunking in a whole assortment of languages.  Its not a long term competitive advantage to do it by themselves.  They also probably have neither the skills or time to make it happen (without pooling).  I have worked for three companies so far that fit into that category one I got very close to just paying for the labeling and donating the content.....clearly it didnt happen.

Coverage in this part varies by company but 

As I see it:

1) Part of the problem is the labeling of the content.  What if we were able to turk this?  It may require breaking down the labeling process into a whole bunch of sub tasks.  Further it would require probably finding a subset of turkers capable of aiding in labeling for this type of advanced task.  I am a fan of companies like CrowdFlower that build on top of amazon mechanical turk and have pre-validated turkers that are known to perform with certain task styles.

2) assuming labeling to a quality (enough level) could be achieved with (1) could we have a fund / charity / kickstarter to pay for this labeling.  Perhaps the funding is split up by language so for instance companies could vote with their money on what they need to get fleshed out.

of course 1 + 2 dont solve the complete picture.  

Thoughts? heckles?

I actually work for a large corp that I can argue we need to put into the pot for several european languages and a couple asian.

C




On Oct 1, 2013, at 11:58 PM, Thomas Zastrow <po...@thomas-zastrow.de> wrote:

> Dear all,
> 
> Some of you mentioned already the Brat tool, so let me point you to WebAnno. It is based on Brat, but has some more functionality like for example extensions for crowdsourcing:
> 
> http://code.google.com/p/webanno/
> 
> Best,
> 
> Tom
> 
> 
> 
> 
> Am 01.10.2013 17:01, schrieb Michael Schmitz:
>> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
>> tagger, and tokenizer.  We're grateful for a high performance library
>> with an Apache license, but one of our greatest complaints is the
>> quality of the models.  Yes--we're aware we can train our own--but
>> most people are looking for something that is good enough out of the
>> box (we aim for this with out products).  I'm not surprised that
>> volunteer engineers don't want to spend their time annotating data ;-)
>> 
>> I'm curious what other people see as the biggest shortcomings for Open
>> NLP or the most important next steps for OpenNlp.  I may have an
>> opportunity to contribute to the project and I'm trying to figure out
>> where the community thinks the biggest impact could be made.
>> 
>> Peace.
>> Michael Schmitz
> 


Re: Next Steps for OpenNLP

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.
Dear all,

Some of you mentioned already the Brat tool, so let me point you to 
WebAnno. It is based on Brat, but has some more functionality like for 
example extensions for crowdsourcing:

http://code.google.com/p/webanno/

Best,

Tom




Am 01.10.2013 17:01, schrieb Michael Schmitz:
> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> tagger, and tokenizer.  We're grateful for a high performance library
> with an Apache license, but one of our greatest complaints is the
> quality of the models.  Yes--we're aware we can train our own--but
> most people are looking for something that is good enough out of the
> box (we aim for this with out products).  I'm not surprised that
> volunteer engineers don't want to spend their time annotating data ;-)
>
> I'm curious what other people see as the biggest shortcomings for Open
> NLP or the most important next steps for OpenNlp.  I may have an
> opportunity to contribute to the project and I'm trying to figure out
> where the community thinks the biggest impact could be made.
>
> Peace.
> Michael Schmitz


Re: Next Steps for OpenNLP

Posted by Giorgio Valoti <gi...@me.com>.

> Il giorno 01/ott/2013, alle ore 21:48, Giorgio Valoti <gi...@me.com> ha scritto:
> 
> […]
> 
> I've read the docs about this aspect, i.e. about collaborating to annotate documents, a few months ago, but I don't remember the details. I hope to get back tomorrow with a brief comparison between GATE and brat. 

I have re-read the documentation about the corpus editing features in GATE. Afaics, gate is more complete, if maybe over-engineered, than brat. I'm not saying the latter is not enough, nor the quality of the comparable features is the same: maybe brat has less to offer but it's more refined. 

Here's the link: http://gate.ac.uk/sale/tao/splitch23.html#x29-60300023

Hope this helps. 

--
Giorgio Valoti

Re: Next Steps for OpenNLP

Posted by Giorgio Valoti <gi...@me.com>.

> Il giorno 01/ott/2013, alle ore 21:31, Jörn Kottmann <ko...@gmail.com> ha scritto:
> 
>> On 10/01/2013 09:03 PM, Giorgio Valoti wrote:
>> 
>>>> Il giorno 01/ott/2013, alle ore 20:48, Jörn Kottmann <ko...@gmail.com> ha scritto:
>>>> 
>>>> On 10/01/2013 07:36 PM, Giorgio Valoti wrote:
>>>> I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links?
>>>> 
>>>> Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable?
>>> The nice thing about wikinews is that it is available in many languages and we could probably develop some cross language
>>> principles.
>> What do you mean by "principles"? Are you talking about the guidelines?
> 
> Yes.
> 
>>> I never really used GATE, not sure how that could work out. Brat is web based and could easily be used by
>>> multiple users. I guess that would fit our needs quite well.
>>> 
>>> How is your experience with GATE?
>> (Please keep in mind that I'm a GATE expert.)
>> 
>> It's a very mature tool, with a bunch of features and a robust architecture. For example, I swapped out the default POS tagger with OpenNLP's in zero time. Nothing fancy, but it just worked.
> 
> We need a tool where a user can manually annotate/correct the data. How is you experience doing that with GATE?

I don't know :) right now I'm working (well, I'm knee deep in mud) to integrate/use GATE within a Clojure project. 

I've read the docs about this aspect, i.e. about collaborating to annotate documents, a few months ago, but I don't remember the details. I hope to get back tomorrow with a brief comparison between GATE and brat. 

--
Giorgio Valoti

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/01/2013 09:03 PM, Giorgio Valoti wrote:
>
>> Il giorno 01/ott/2013, alle ore 20:48, Jörn Kottmann <ko...@gmail.com> ha scritto:
>>
>>> On 10/01/2013 07:36 PM, Giorgio Valoti wrote:
>>> I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links?
>>>
>>> Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable?
>> The nice thing about wikinews is that it is available in many languages and we could probably develop some cross language
>> principles.
> What do you mean by "principles"? Are you talking about the guidelines?

Yes.

>> I never really used GATE, not sure how that could work out. Brat is web based and could easily be used by
>> multiple users. I guess that would fit our needs quite well.
>>
>> How is your experience with GATE?
> (Please keep in mind that I'm a GATE expert.)
>
> It's a very mature tool, with a bunch of features and a robust architecture. For example, I swapped out the default POS tagger with OpenNLP's in zero time. Nothing fancy, but it just worked.

We need a tool where a user can manually annotate/correct the data. How 
is you experience doing that with GATE?

To create a corpus with a group of contributors it must be easy to get 
started and then contribute changes.

Jörn


Re: Next Steps for OpenNLP

Posted by Giorgio Valoti <gi...@me.com>.

> Il giorno 01/ott/2013, alle ore 20:48, Jörn Kottmann <ko...@gmail.com> ha scritto:
> 
>> On 10/01/2013 07:36 PM, Giorgio Valoti wrote:
>> I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links?
>> 
>> Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable?
> 
> The nice thing about wikinews is that it is available in many languages and we could probably develop some cross language
> principles.

What do you mean by "principles"? Are you talking about the guidelines?

> 
> I never really used GATE, not sure how that could work out. Brat is web based and could easily be used by
> multiple users. I guess that would fit our needs quite well.
> 
> How is your experience with GATE?

(Please keep in mind that I'm a GATE expert.)

It's a very mature tool, with a bunch of features and a robust architecture. For example, I swapped out the default POS tagger with OpenNLP's in zero time. Nothing fancy, but it just worked. 

--
Giorgio Valoti

> 
> Jörn

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/01/2013 07:36 PM, Giorgio Valoti wrote:
> I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links?
>
> Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable?

The nice thing about wikinews is that it is available in many languages 
and we could probably develop some cross language
principles.

I never really used GATE, not sure how that could work out. Brat is web 
based and could easily be used by
multiple users. I guess that would fit our needs quite well.

How is your experience with GATE?

Jörn

Re: Next Steps for OpenNLP

Posted by Giorgio Valoti <gi...@me.com>.
> Il giorno 01/ott/2013, alle ore 18:02, Jörn Kottmann <ko...@gmail.com> ha scritto:
> 
>> On 10/01/2013 05:36 PM, Ryan Josal wrote:
>> That is what I'm doing.  I've set up semaphore pools for all my TokenNameFinders.  I would wonder if there's any technical concession one would have to make a TokenNameFinder thread safe.  What would happen to the adaptive data?  On the topic of models, the sourceforge ones have been certainly useful; I'm mainly using the NER models, but indeed more models, or models trained on more recent data would be nice.  But I know training data, even without annotations doesn't come out of thin air, otherwise I'd have created a few models myself.
> 
> If there is an interest and contributors it would be possible to label wikinews data (we worked a bit on that), but sure there are more sources
> of documents which could be obtained with an Apache compatible license.
> 
> Anyway I guess the process to create training data as part of the OpenNLP process would be kind of as follows:
> - Obtain some raw text
> - Write an annotation guide (maybe based on some existing ones)
> - Agree on an annotation tool to use (e.g. brat)
> - Annotate a few hundred documents
> - Make the first release of the corpus

I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links?

Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable?


Thanks
--
Giorgio Valoti


Re: Next Steps for OpenNLP

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hi,

It all depends on the uses you want to give to OpenNLP. If you want to
do basic research, you will still need to acquire the standard corpora
for a given task (e.g., Penn Treebank for English parsing, etc.) if
you want to be able to compare with previous approaches and for your
results to be publishable.

Cheers,

Rodrigo


On Tue, Oct 1, 2013 at 6:02 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/01/2013 05:36 PM, Ryan Josal wrote:
>>
>> That is what I'm doing.  I've set up semaphore pools for all my
>> TokenNameFinders.  I would wonder if there's any technical concession one
>> would have to make a TokenNameFinder thread safe.  What would happen to the
>> adaptive data?  On the topic of models, the sourceforge ones have been
>> certainly useful; I'm mainly using the NER models, but indeed more models,
>> or models trained on more recent data would be nice.  But I know training
>> data, even without annotations doesn't come out of thin air, otherwise I'd
>> have created a few models myself.
>
>
> If there is an interest and contributors it would be possible to label
> wikinews data (we worked a bit on that), but sure there are more sources
> of documents which could be obtained with an Apache compatible license.
>
> Anyway I guess the process to create training data as part of the OpenNLP
> process would be kind of as follows:
> - Obtain some raw text
> - Write an annotation guide (maybe based on some existing ones)
> - Agree on an annotation tool to use (e.g. brat)
> - Annotate a few hundred documents
> - Make the first release of the corpus
>
> Jörn

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/01/2013 05:36 PM, Ryan Josal wrote:
> That is what I'm doing.  I've set up semaphore pools for all my TokenNameFinders.  I would wonder if there's any technical concession one would have to make a TokenNameFinder thread safe.  What would happen to the adaptive data?  On the topic of models, the sourceforge ones have been certainly useful; I'm mainly using the NER models, but indeed more models, or models trained on more recent data would be nice.  But I know training data, even without annotations doesn't come out of thin air, otherwise I'd have created a few models myself.

If there is an interest and contributors it would be possible to label 
wikinews data (we worked a bit on that), but sure there are more sources
of documents which could be obtained with an Apache compatible license.

Anyway I guess the process to create training data as part of the 
OpenNLP process would be kind of as follows:
- Obtain some raw text
- Write an annotation guide (maybe based on some existing ones)
- Agree on an annotation tool to use (e.g. brat)
- Annotate a few hundred documents
- Make the first release of the corpus

Jörn

Re: Next Steps for OpenNLP

Posted by Ryan Josal <ry...@josal.com>.
That is what I'm doing.  I've set up semaphore pools for all my TokenNameFinders.  I would wonder if there's any technical concession one would have to make a TokenNameFinder thread safe.  What would happen to the adaptive data?  On the topic of models, the sourceforge ones have been certainly useful; I'm mainly using the NER models, but indeed more models, or models trained on more recent data would be nice.  But I know training data, even without annotations doesn't come out of thin air, otherwise I'd have created a few models myself.

Ryan

On Oct 1, 2013, at 8:29 AM, Jörn Kottmann wrote:

> On 10/01/2013 05:23 PM, Alexandre Patry wrote:
>> I would really like to see thread-safe models. Right now, to use the same model in many threads requires to make a copy for each thread. 
> 
> The models can be reused across threads, but you need to create a new instance of the component using the model.
> If there are any issues with that please report it as a bug.
> 
> Thanks,
> Jörn


Re: Next Steps for OpenNLP

Posted by Alexandre Patry <al...@nlpfu.com>.
On 2013-10-01 11:29, Jörn Kottmann wrote:
> On 10/01/2013 05:23 PM, Alexandre Patry wrote:
>> I would really like to see thread-safe models. Right now, to use the 
>> same model in many threads requires to make a copy for each thread.
> 
> The models can be reused across threads, but you need to create a new
> instance of the component using the model.
> If there are any issues with that please report it as a bug.

My bad, sorry about the false report but glad to learn that!

Alexandre

Re: Next Steps for OpenNLP

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/01/2013 05:23 PM, Alexandre Patry wrote:
> I would really like to see thread-safe models. Right now, to use the 
> same model in many threads requires to make a copy for each thread. 

The models can be reused across threads, but you need to create a new 
instance of the component using the model.
If there are any issues with that please report it as a bug.

Thanks,
Jörn

Re: Next Steps for OpenNLP

Posted by Alexandre Patry <al...@nlpfu.com>.
On 2013-10-01 11:01, Michael Schmitz wrote:
> I'm curious what other people see as the biggest shortcomings for Open
> NLP or the most important next steps for OpenNlp.  I may have an
> opportunity to contribute to the project and I'm trying to figure out
> where the community thinks the biggest impact could be made.

I would really like to see thread-safe models. Right now, to use the 
same model in many threads requires to make a copy for each thread.

Best regards,

Alexandre Patry