You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2011/06/07 16:26:23 UTC

OpenNLP Annotations Proposal

Hi all,

based on some discussion we had in the past I put together
a short proposal for a community based labeling project.

Here is the link:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

Any comments and opinions are very welcome.

Thanks,
Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/23 Jörn Kottmann <ko...@gmail.com>

> On 6/22/11 7:53 PM, Olivier Grisel wrote:
>
>> We can also fix by having an option to delete "garbage" texts from the
>>> corpus.
>>>
>> Yes, discarding a whole CAS. But if the CAS is document level instead
>> of sentence level, that might be an issue.
>>
>>
> It depends, if the whole article is in such a bad condition that annotating
> it does not
> make sense it should be discarded. If only a small part of the article
> cannot be annotated,
> the annotator can skip over this part.
>
>  What other kind of data do you think we should store outside the CAses?
>>>
>> If we ignore the Sofa editing use case, probably nothing.
>>
>>  +1, to do that for now.
>
>
>  Also do you know of a good database for storing CAS? For instance does
>>>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>>>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>>>> Derby for instance?
>>>>
>>> I did a couple of tests with HBase and it was very easy to store 100M of
>>> CASes,
>>> anyway we do not really need to scale to that huge amounts, so I believe
>>> an
>>> NoSQL or relational database would be just fine.
>>>
>> I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
>> we target human annotators, we won't have terabytes of text data
>> anyway and all data will probably fit in memory in most cases. I was
>> thinking about using a DB to be able to handle concurrent editing by
>> several annotators (+ ability to do search in the Sofa content) in a
>> simple way.
>>
>
> Yeah, it does not seem important which DB we use, since most will
> just work well for us.
>
> I believe concurrent editing is more a question of the data model we choose
> and to support search I would use something Lucene based instead of the
> features
> some DBs might have.
>
> For training it is also important that we can iterate
> over all items in a reasonable time.
>
> I actually like BigTables Column Family model because
> it is easy to store a sofa plus feature structures in the columns,
> iterating
> is fast and it can be scaled to huge amounts of data if needed.
>
> Anyway, maybe it would be good to start with derby and just store XMI files
> in
> it, what do you think?
>

+1 at the moment it's not that important what storage solution to use, as it
can be improved once the basic functionalities are finished.
I also imagine such a system with one CAS per doc with sentence level
annotations.
Tommaso


>
> Jörn
>

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 7:53 PM, Olivier Grisel wrote:
>> We can also fix by having an option to delete "garbage" texts from the
>> corpus.
> Yes, discarding a whole CAS. But if the CAS is document level instead
> of sentence level, that might be an issue.
>

It depends, if the whole article is in such a bad condition that 
annotating it does not
make sense it should be discarded. If only a small part of the article 
cannot be annotated,
the annotator can skip over this part.
>> What other kind of data do you think we should store outside the CAses?
> If we ignore the Sofa editing use case, probably nothing.
>
+1, to do that for now.

>>> Also do you know of a good database for storing CAS? For instance does
>>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>>> Derby for instance?
>> I did a couple of tests with HBase and it was very easy to store 100M of
>> CASes,
>> anyway we do not really need to scale to that huge amounts, so I believe an
>> NoSQL or relational database would be just fine.
> I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
> we target human annotators, we won't have terabytes of text data
> anyway and all data will probably fit in memory in most cases. I was
> thinking about using a DB to be able to handle concurrent editing by
> several annotators (+ ability to do search in the Sofa content) in a
> simple way.

Yeah, it does not seem important which DB we use, since most will
just work well for us.

I believe concurrent editing is more a question of the data model we choose
and to support search I would use something Lucene based instead of the 
features
some DBs might have.

For training it is also important that we can iterate
over all items in a reasonable time.

I actually like BigTables Column Family model because
it is easy to store a sofa plus feature structures in the columns, iterating
is fast and it can be scaled to huge amounts of data if needed.

Anyway, maybe it would be good to start with derby and just store XMI 
files in
it, what do you think?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jason Baldridge <ja...@gmail.com>.

I defer to all of you on the specifics of the annotation infrastructure.
Great to see this moving forward!

One thing to throw in is that we may be able to take advantage of some
resources to bootstrap initial components. For example, I'm working with a
student to bootstrap multilingual POS taggers using Wiktionary as the tag
dictionary and a combination of label propagation and HMMs. This will have
lots of errors, but could be a useful starting point.

+1 for interest in Spark.

Jason


On Wed, Jun 22, 2011 at 2:59 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 6/22/11 8:13 PM, Hannes Korte wrote:
>
>> On 06/22/2011 07:53 PM, Olivier Grisel wrote:
>>
>>> 2011/6/22 Jörn Kottmann<ko...@gmail.com>:
>>>
>>>> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>>>
>>>>> I am ok with switching to UIMA CAS. We might need additional metadata
>>>>> outside of the CAS annotations though. For instance if the annotators
>>>>> fixes a typo in the Sofa it-self, we might need to be able to tell
>>>>> that Sofa1 is subject to being replaced by Sofa2 according to
>>>>> annotator A1 for instance.
>>>>>
>>>>>  I am not sure if we should fix such mistakes, the system will also
>>>> encounter
>>>> them in real data it needs to process. Fixing typos, or correcting
>>>> things in
>>>> the text is
>>>> always difficult when there are already existing annotations.
>>>>
>>>> Do you feel fixing mistakes in the text is important?
>>>>
>>> We can leave that issue as a low priority discussion for later and
>>> just ignore it for now.
>>>
>>>
>>>  We can also fix by having an option to delete "garbage" texts from the
>>>> corpus.
>>>>
>>> Yes, discarding a whole CAS. But if the CAS is document level instead
>>> of sentence level, that might be an issue.
>>>
>> Let's say we have a CAS type Sentence, which will not be changed, and
>> another type AnnotatedSentence. Each time a sentence was annotated by a
>> user, a new AnnotatedSentence annotation will be created in the same
>> span containing information about the user and the state of the sentence
>> (e.g. correct, unsure, or discarded). This way we can store all that
>> without the need for changes to the Sofa. Alternatively, each Sentence
>> could have a List of something like AnnotationMetadata.
>>
>
> The only reason to change a sofa is, when the user wants to change the text
> itself, right? How would the AnnotatedSentence annotation do that?
> Would it just store the changed text a string feature?
>
>
>  I believe the Corpus server should be independent of the other components
>>>> and define some kind of remote API for data interchange.
>>>>
>>> Is there a JSON version of XMI? Hannes, what is your opinion on this?
>>>
>> A separate corpus server sounds good to me. But this server can simply
>> deliver the default XMI representation of the CASes. I think the
>> documents have to be preprocessed for annotation on the server side of
>> the WebGUI anyways. The JS client should not call the corpus server
>> directly.
>>
> +1
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 8:13 PM, Hannes Korte wrote:
> On 06/22/2011 07:53 PM, Olivier Grisel wrote:
>> 2011/6/22 Jörn Kottmann<ko...@gmail.com>:
>>> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>>> I am ok with switching to UIMA CAS. We might need additional metadata
>>>> outside of the CAS annotations though. For instance if the annotators
>>>> fixes a typo in the Sofa it-self, we might need to be able to tell
>>>> that Sofa1 is subject to being replaced by Sofa2 according to
>>>> annotator A1 for instance.
>>>>
>>> I am not sure if we should fix such mistakes, the system will also encounter
>>> them in real data it needs to process. Fixing typos, or correcting things in
>>> the text is
>>> always difficult when there are already existing annotations.
>>>
>>> Do you feel fixing mistakes in the text is important?
>> We can leave that issue as a low priority discussion for later and
>> just ignore it for now.
>>
>>
>>> We can also fix by having an option to delete "garbage" texts from the
>>> corpus.
>> Yes, discarding a whole CAS. But if the CAS is document level instead
>> of sentence level, that might be an issue.
> Let's say we have a CAS type Sentence, which will not be changed, and
> another type AnnotatedSentence. Each time a sentence was annotated by a
> user, a new AnnotatedSentence annotation will be created in the same
> span containing information about the user and the state of the sentence
> (e.g. correct, unsure, or discarded). This way we can store all that
> without the need for changes to the Sofa. Alternatively, each Sentence
> could have a List of something like AnnotationMetadata.

The only reason to change a sofa is, when the user wants to change the text
itself, right? How would the AnnotatedSentence annotation do that?
Would it just store the changed text a string feature?

>>> I believe the Corpus server should be independent of the other components
>>> and define some kind of remote API for data interchange.
>> Is there a JSON version of XMI? Hannes, what is your opinion on this?
> A separate corpus server sounds good to me. But this server can simply
> deliver the default XMI representation of the CASes. I think the
> documents have to be preprocessed for annotation on the server side of
> the WebGUI anyways. The JS client should not call the corpus server
> directly.
+1

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

+1 for both comments.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 06/22/2011 07:53 PM, Olivier Grisel wrote:
> 2011/6/22 Jörn Kottmann <ko...@gmail.com>:
>> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>>
>>> I am ok with switching to UIMA CAS. We might need additional metadata
>>> outside of the CAS annotations though. For instance if the annotators
>>> fixes a typo in the Sofa it-self, we might need to be able to tell
>>> that Sofa1 is subject to being replaced by Sofa2 according to
>>> annotator A1 for instance.
>>>
>>
>> I am not sure if we should fix such mistakes, the system will also encounter
>> them in real data it needs to process. Fixing typos, or correcting things in
>> the text is
>> always difficult when there are already existing annotations.
>>
>> Do you feel fixing mistakes in the text is important?
> 
> We can leave that issue as a low priority discussion for later and
> just ignore it for now.
> 
> 
>> We can also fix by having an option to delete "garbage" texts from the
>> corpus.
> 
> Yes, discarding a whole CAS. But if the CAS is document level instead
> of sentence level, that might be an issue.

Let's say we have a CAS type Sentence, which will not be changed, and
another type AnnotatedSentence. Each time a sentence was annotated by a
user, a new AnnotatedSentence annotation will be created in the same
span containing information about the user and the state of the sentence
(e.g. correct, unsure, or discarded). This way we can store all that
without the need for changes to the Sofa. Alternatively, each Sentence
could have a List of something like AnnotationMetadata.

> ...

>> I believe the Corpus server should be independent of the other components
>> and define some kind of remote API for data interchange.
> 
> Is there a JSON version of XMI? Hannes, what is your opinion on this?

A separate corpus server sounds good to me. But this server can simply
deliver the default XMI representation of the CASes. I think the
documents have to be preprocessed for annotation on the server side of
the WebGUI anyways. The JS client should not call the corpus server
directly.

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:
> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>
>> I am ok with switching to UIMA CAS. We might need additional metadata
>> outside of the CAS annotations though. For instance if the annotators
>> fixes a typo in the Sofa it-self, we might need to be able to tell
>> that Sofa1 is subject to being replaced by Sofa2 according to
>> annotator A1 for instance.
>>
>
> I am not sure if we should fix such mistakes, the system will also encounter
> them in real data it needs to process. Fixing typos, or correcting things in
> the text is
> always difficult when there are already existing annotations.
>
> Do you feel fixing mistakes in the text is important?

We can leave that issue as a low priority discussion for later and
just ignore it for now.


> We can also fix by having an option to delete "garbage" texts from the
> corpus.

Yes, discarding a whole CAS. But if the CAS is document level instead
of sentence level, that might be an issue.

> What other kind of data do you think we should store outside the CAses?

If we ignore the Sofa editing use case, probably nothing.

>> Also do you know of a good database for storing CAS? For instance does
>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>> Derby for instance?
>
> I did a couple of tests with HBase and it was very easy to store 100M of
> CASes,
> anyway we do not really need to scale to that huge amounts, so I believe an
> NoSQL or relational database would be just fine.

I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
we target human annotators, we won't have terabytes of text data
anyway and all data will probably fit in memory in most cases. I was
thinking about using a DB to be able to handle concurrent editing by
several annotators (+ ability to do search in the Sofa content) in a
simple way.

> To get started I believe we should just store a CAS as XMI and in a later
> stage
> we can work on optimizing the CAS storage to our needs and maybe even work
> together with the UIMA team on a more general corpus server, I know several
> people who have interest in this.

Alright. Let's use plain XMI files parsed and loaded in memory at the
beginning of annotation session.

> I believe the Corpus server should be independent of the other components
> and define some kind of remote API for data interchange.

Is there a JSON version of XMI? Hannes, what is your opinion on this?

> If we define such an API the actual storage system can be interchange easily
> at a later point in time.

Ok.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 6:50 PM, Olivier Grisel wrote:
> I am ok with switching to UIMA CAS. We might need additional metadata
> outside of the CAS annotations though. For instance if the annotators
> fixes a typo in the Sofa it-self, we might need to be able to tell
> that Sofa1 is subject to being replaced by Sofa2 according to
> annotator A1 for instance.
>

I am not sure if we should fix such mistakes, the system will also encounter
them in real data it needs to process. Fixing typos, or correcting 
things in the text is
always difficult when there are already existing annotations.

Do you feel fixing mistakes in the text is important?

We can also fix by having an option to delete "garbage" texts from the 
corpus.

What other kind of data do you think we should store outside the CAses?
> Also do you know of a good database for storing CAS? For instance does
> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
> a JDCB CASConsumer + CollectionReader that we could use with Apache
> Derby for instance?

I did a couple of tests with HBase and it was very easy to store 100M of 
CASes,
anyway we do not really need to scale to that huge amounts, so I believe an
NoSQL or relational database would be just fine.

To get started I believe we should just store a CAS as XMI and in a 
later stage
we can work on optimizing the CAS storage to our needs and maybe even work
together with the UIMA team on a more general corpus server, I know several
people who have interest in this.

I believe the Corpus server should be independent of the other components
and define some kind of remote API for data interchange.
If we define such an API the actual storage system can be interchange 
easily at
a later point in time.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Hannes Korte <ha...@iais.fraunhofer.de>:
> On 06/22/2011 06:50 PM, Olivier Grisel wrote:
>> I am ok with switching to UIMA CAS. We might need additional metadata
>> outside of the CAS annotations though. For instance if the annotators
>> fixes a typo in the Sofa it-self, we might need to be able to tell
>> that Sofa1 is subject to being replaced by Sofa2 according to
>> annotator A1 for instance.
>
> Do we have one CAS per sentence or one CAS per document? If the former
> is the case, then we will need some more metadata around the CAS
> documents to be able to show the context of a given sentence (if that is
> needed at all). If the latter is the case, then this will lead to many
> different Sofas, which only differ in a few characters, right?
>
> If we want to add disambiguation and coref information into the
> annotator UI at a later stage, then one CAS per document would be much
> more useful.

I am +1 for one CAS per document with intra-CAS fast navigation using
keyboard and filtered sentences at the UI level only. However
pignlproc output OpenNLP formatted sentences without document
information. But this can change (it's just not implemented yet).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:
>
>> Do we have one CAS per sentence or one CAS per document? If the former
>> is the case, then we will need some more metadata around the CAS
>> documents to be able to show the context of a given sentence (if that is
>> needed at all). If the latter is the case, then this will lead to many
>> different Sofas, which only differ in a few characters, right?
>>
>
> I was thinking about a system where we have one CAS per document,
> but our tooling should still collect annotation on a sentence level.
> So a user needs to annotate at least one sentence to add something
> useful to the CAS. The training code should then take care of training
> on a document which only contains a few annotated sentences.

I agree.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 7:38 PM, Hannes Korte wrote:
> On 06/22/2011 06:50 PM, Olivier Grisel wrote:
>> I am ok with switching to UIMA CAS. We might need additional metadata
>> outside of the CAS annotations though. For instance if the annotators
>> fixes a typo in the Sofa it-self, we might need to be able to tell
>> that Sofa1 is subject to being replaced by Sofa2 according to
>> annotator A1 for instance.
> Do we have one CAS per sentence or one CAS per document? If the former
> is the case, then we will need some more metadata around the CAS
> documents to be able to show the context of a given sentence (if that is
> needed at all). If the latter is the case, then this will lead to many
> different Sofas, which only differ in a few characters, right?
>

I was thinking about a system where we have one CAS per document,
but our tooling should still collect annotation on a sentence level.
So a user needs to annotate at least one sentence to add something
useful to the CAS. The training code should then take care of training
on a document which only contains a few annotated sentences.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 06/22/2011 06:50 PM, Olivier Grisel wrote:
> I am ok with switching to UIMA CAS. We might need additional metadata
> outside of the CAS annotations though. For instance if the annotators
> fixes a typo in the Sofa it-self, we might need to be able to tell
> that Sofa1 is subject to being replaced by Sofa2 according to
> annotator A1 for instance.

Do we have one CAS per sentence or one CAS per document? If the former
is the case, then we will need some more metadata around the CAS
documents to be able to show the context of a given sentence (if that is
needed at all). If the latter is the case, then this will lead to many
different Sofas, which only differ in a few characters, right?

If we want to add disambiguation and coref information into the
annotator UI at a later stage, then one CAS per document would be much
more useful.

Hannes

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:
> Any other opinions on how we should store/exchange our
> text with annotations?
>
> As proposed up to now:
> 1. UIMA CAS based approach
> 2. Custom solution as proposed by Olivier

I am ok with switching to UIMA CAS. We might need additional metadata
outside of the CAS annotations though. For instance if the annotators
fixes a typo in the Sofa it-self, we might need to be able to tell
that Sofa1 is subject to being replaced by Sofa2 according to
annotator A1 for instance.

Also do you know of a good database for storing CAS? For instance does
there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
a JDCB CASConsumer + CollectionReader that we could use with Apache
Derby for instance?

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

Any other opinions on how we should store/exchange our
text with annotations?

As proposed up to now:
1. UIMA CAS based approach
2. Custom solution as proposed by Olivier

I think we should reach consensus here quickly
so we can start extending the proposal.

And if there are no objections I suggest that we include
the Corpus Refiner in the proposal as a web based tool
to update/verify/annotate a corpus.

Jörn

On 6/22/11 11:38 AM, Olivier Grisel wrote:
> 2011/6/22 Jörn Kottmann<ko...@gmail.com>:
>> On 6/22/11 10:45 AM, Olivier Grisel wrote:
>>> I wind the UIMA CAS API much more complicated to work with than
>>> directly working with token-level concepts with the OpenNLP API (i.e.
>>> with arrays of Span). I haven't add a look at the opennlp-uima
>>> subproject though: you probably already have tooling and predefined
>>> type systems that makes interoperability with CAS instance less of a
>>> pain.
>> If you look at annotation tool they usually always give some flexibility to
>> the user
>> in terms what kind of annotations they are allowed to add. One thing I
>> always see is
>> as soon as they allow more complex annotations the tools and code which
>> handles to
>> annotations gets also complex. Have a look at Wordfreak or Gate.
>>
>> The CAS might be difficult to use first, but at least it works and is
>> very well tested. If we create a custom solution we might end up with
>> a similar complexity anyway.
>>
>> We would need to define a type system, but that is something we need
>> to do anyway independent of which way we implement it.
>> Maybe we even need to support different type systems for different corpora.
>> I guess we start with wikipedia based data, but one day we might want to
>> annotate an email or blog corpus.
>>
>> It is an interesting question how the type system should look, since we need
>> to
>> track where the annotations come from, and might even want some to be double
>> checked,
>> or need to annotate the disagreement of annotators.
> Point taken.
>

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:
> On 6/22/11 10:45 AM, Olivier Grisel wrote:
>>
>> I wind the UIMA CAS API much more complicated to work with than
>> directly working with token-level concepts with the OpenNLP API (i.e.
>> with arrays of Span). I haven't add a look at the opennlp-uima
>> subproject though: you probably already have tooling and predefined
>> type systems that makes interoperability with CAS instance less of a
>> pain.
>
> If you look at annotation tool they usually always give some flexibility to
> the user
> in terms what kind of annotations they are allowed to add. One thing I
> always see is
> as soon as they allow more complex annotations the tools and code which
> handles to
> annotations gets also complex. Have a look at Wordfreak or Gate.
>
> The CAS might be difficult to use first, but at least it works and is
> very well tested. If we create a custom solution we might end up with
> a similar complexity anyway.
>
> We would need to define a type system, but that is something we need
> to do anyway independent of which way we implement it.
> Maybe we even need to support different type systems for different corpora.
> I guess we start with wikipedia based data, but one day we might want to
> annotate an email or blog corpus.
>
> It is an interesting question how the type system should look, since we need
> to
> track where the annotations come from, and might even want some to be double
> checked,
> or need to annotate the disagreement of annotators.

Point taken.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 10:45 AM, Olivier Grisel wrote:
> I wind the UIMA CAS API much more complicated to work with than
> directly working with token-level concepts with the OpenNLP API (i.e.
> with arrays of Span). I haven't add a look at the opennlp-uima
> subproject though: you probably already have tooling and predefined
> type systems that makes interoperability with CAS instance less of a
> pain.

If you look at annotation tool they usually always give some flexibility 
to the user
in terms what kind of annotations they are allowed to add. One thing I 
always see is
as soon as they allow more complex annotations the tools and code which 
handles to
annotations gets also complex. Have a look at Wordfreak or Gate.

The CAS might be difficult to use first, but at least it works and is
very well tested. If we create a custom solution we might end up with
a similar complexity anyway.

We would need to define a type system, but that is something we need
to do anyway independent of which way we implement it.
Maybe we even need to support different type systems for different corpora.
I guess we start with wikipedia based data, but one day we might want to
annotate an email or blog corpus.

It is an interesting question how the type system should look, since we 
need to
track where the annotations come from, and might even want some to be 
double checked,
or need to annotate the disagreement of annotators.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

> On 24.06.2011 10:03, Jörn Kottmann wrote:
>> Hannes and Olivier, do you want to take over the part about the web 
>> based
>> annotation tooling? I called it for now Corpus Refiner, but we can of
>> course change
>> the name to something else.
>
> Yes, I'll try to find some time in the next days to have a look at 
> what Olivier already committed and to work on the javascript part of 
> the webGUI.

Then we should soon start a vote to accept it as a contribution and move 
it over to our sandbox.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 7:42 PM, Hannes Korte wrote:
> On 06/24/2011 06:37 PM, Jörn Kottmann wrote:
> You mean we take the user annotations above a certain agreement level
> from the first class types to the second class types to get the gold
> annotations? For entities this is no problem, but where do we start for
> tokens and sentences? I think we intially apply the current OpenNLP
> sentence splitter and tokenizer, right?
>
Exactly for sentences we have a special annotation to label 
end-of-sentence characters
as split or not split. And we do the same for tokens, but there the 
split annotation has a
length of zero. The users can then vote on these annotations. Since its 
a binary decision, it
is either true or false.
>> For example, we could ask the annotators to label token splits, form
>> these token splits we can derive the actual token annotations. For
>> english texts the annotation ui could make use of the alpha num
>> optimization and only ask the user for questionable token splits.
> Ok, so similar to the entities the UI needs to show the token boudaries
> as well as functionality to change these. Or do you want this
> functionality in a different UI than the named entity one?

I am not sure, maybe we should just try something and then refine it after
a little experimenting. I think both would work in the beginning.

I guess we do best when we only hand out articles with high quality 
tokenization
to tasks which depend on it. So maybe it would be good to have some ui 
to quickly
confirm that the tokenization is ok.

>> For named entity annotations the user could do BIO style token
>> labeling through a special ui, similar to the one in Walter. The BIO
>> labels can then be used to compute the name spans.
> Until the beginning of this post I thought we use the name spans to
> compute the BIO labels not the other way round. But if we show the
> tokens as single blocks, then it makes sense to use some sort of
> BIO-style annotations.
>
> For example, the user navigates over the tokens with the left and right
> arrow keys. If he hits "P" (for "B-PER") then the focus moves to the
> next token. Hitting "p" marks it as "I-PER", hitting "P" another time
> marks it as a new entity ("B-PER") and hitting "space" marks it as "O",
> i.e., removing a previous annotation. The arrow keys don't change the
> label. Feels pretty usable in my mind.. :)

Yes, but the labeling ui itself can also use other methods, e.g. confirm 
existing entities
and then confirm the entire sentence, the ui code can simply transform 
this into BIO-style
annotations, and the UI will be able to offer a veto with a comment for 
a single
token. Maybe we decide to label person names without a title in front, 
for example Mr. *Smith*
but now someone lables it as *Mr. Smith*, then a user vetoes the 
annotation on Mr. and insert a short
comment.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 06/24/2011 06:37 PM, Jörn Kottmann wrote:
> I suggest that there are two classes of types in the type system.
> 
> The first class contains annotations which describe the input we
> collect from our annotators and are also suitable to document
> comments and disagreements between annotators.
> 
> And the second class of annotations contain standard linguistic
> annotations such as sentences, tokens, entities, chunks, parses,
> etc.

+1

> The idea is that the annotation in the second class can be
> automatically be derived from the annotations in the first class. In
> case the article is not completely labeled the statistic models could
> fill the gap.

You mean we take the user annotations above a certain agreement level
from the first class types to the second class types to get the gold
annotations? For entities this is no problem, but where do we start for
tokens and sentences? I think we intially apply the current OpenNLP
sentence splitter and tokenizer, right?

> For example, we could ask the annotators to label token splits, form
> these token splits we can derive the actual token annotations. For
> english texts the annotation ui could make use of the alpha num
> optimization and only ask the user for questionable token splits.

Ok, so similar to the entities the UI needs to show the token boudaries
as well as functionality to change these. Or do you want this
functionality in a different UI than the named entity one?

> For named entity annotations the user could do BIO style token
> labeling through a special ui, similar to the one in Walter. The BIO
> labels can then be used to compute the name spans.

Until the beginning of this post I thought we use the name spans to
compute the BIO labels not the other way round. But if we show the
tokens as single blocks, then it makes sense to use some sort of
BIO-style annotations.

For example, the user navigates over the tokens with the left and right
arrow keys. If he hits "P" (for "B-PER") then the focus moves to the
next token. Hitting "p" marks it as "I-PER", hitting "P" another time
marks it as a new entity ("B-PER") and hitting "space" marks it as "O",
i.e., removing a previous annotation. The arrow keys don't change the
label. Feels pretty usable in my mind.. :)

Hannes

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 7:32 PM, Hannes Korte wrote:
> How about a button like "Incomplete sentence" in the entity UI. When the
> user hits it, he/she gets the context and can select the complete
> sentence. I guess this will get really complicated to merge all the
> different user annotations then. But at least we don't need an
> additional annotation task for the sentence labeling.

For sentence information we could have an annotation which marks 
end-of-sentence
characters. If the a user now find an invalid sentence he can insert 
such an annotation
and the generated sentence annotations can be corrected.

It does not really matter where this information comes from, I first 
thought it might
be nice to have a dedicated ui for this. But it could also be part of an 
ui to label entities,
as suggested by Hannes, but the created annotation would still be the same.

The entity annotation itself could also be treated as a confirmation 
that something
is not an end-of-sentence character. Lets take Yahoo! for example, if it 
is labeled
as organization and one token we should annotate the exclamation mark as an
end-of-sentence character which is not a sentence end.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 06/24/2011 06:55 PM, Jörn Kottmann wrote:
> On 6/24/11 6:42 PM, Olivier Grisel wrote:
>> I like the ability to move the UI focus from one sentence to
>> another and being able to mark a complete sentence as validated. +1
>> for the rest of your proposal.
> Sure, sentence annotations are available, but they also need to be 
> annotated to some degree or a user must be able to correct them. Some
> ui to label the sentence information is needed for that.
> 
> The uis for the other tasks such as named entity labeling, pos
> labeling will be able to use the sentence annotations.
> 
> For example the named entity labeling ui can be operated on a
> sentence level and the user can mark spans which contain entities, or
> confirm existing entities. Or maybe just confirm the entire sentence,
> in case the user confirms the entire sentence we could just update
> all tokens in the sentence, or add a special annotation to track
> this. But maybe it is easier to just confirm all tokens in the
> sentence, because then there is no issue if the sentence annotation
> itself is corrected.

How about a button like "Incomplete sentence" in the entity UI. When the
user hits it, he/she gets the context and can select the complete
sentence. I guess this will get really complicated to merge all the
different user annotations then. But at least we don't need an
additional annotation task for the sentence labeling.

Hannes

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 6:42 PM, Olivier Grisel wrote:
> I like the ability to move the UI focus from one sentence to another
> and being able to mark a complete sentence as validated. +1 for the
> rest of your proposal.
Sure, sentence annotations are available, but they also need to be annotated
to some degree or a user must be able to correct them.
Some ui to label the sentence information is needed for that.

The uis for the other tasks such as named entity labeling, pos labeling
will be able to use the sentence annotations.

For example the named entity labeling ui can be operated on a sentence
level and the user can mark spans which contain entities, or confirm 
existing
entities. Or maybe just confirm the entire sentence, in case the user 
confirms
the entire sentence we could just update all tokens in the sentence, or 
add a special
annotation to track this. But maybe it is easier to just confirm all 
tokens in the sentence,
because then there is no issue if the sentence annotation itself is 
corrected.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/24 Jörn Kottmann <ko...@gmail.com>:
> On 6/24/11 11:54 AM, Olivier Grisel wrote:
>>
>> but we need to agree on a CAS type system first. I don't
>> know the opennlp-uima myself and won't have time to invest more effort
>> on this project before mid-july unfortunately.
>
> I suggest that there are two classes of types in the type system.
>
> The first class contains annotations which describe the input we collect
> from our annotators and are also suitable to document comments and
> disagreements
> between annotators.
>
> And the second class of annotations contain standard linguistic annotations
> such as sentences, tokens, entities, chunks, parses, etc.
>
> The idea is that the annotation in the second class can be automatically
> be derived from the annotations in the first class. In case the article is
> not
> completely labeled the statistic models could fill the gap.
>
> For example, we could ask the annotators to label token splits, form these
> token splits we can derive the actual token annotations. For english texts
> the annotation ui could make use of the alpha num optimization and only
> ask the user for questionable token splits.
>
> A similar approach could be done for sentence annotations.
>
> For named entity annotations the user could do BIO style token labeling
> through a
> special ui, similar to the one in Walter. The BIO labels can then be used to
> compute the
> name spans.
>
> Our models can either be trained directly on the derived annotations, or we
> add a sentence level
> annotation where users needs to confirm that the entire sentence is labeled
> correctly, for example
> all person annotation are marked in this sentence.

I like the ability to move the UI focus from one sentence to another
and being able to mark a complete sentence as validated. +1 for the
rest of your proposal.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 11:54 AM, Olivier Grisel wrote:
> but we need to agree on a CAS type system first. I don't
> know the opennlp-uima myself and won't have time to invest more effort
> on this project before mid-july unfortunately.

I suggest that there are two classes of types in the type system.

The first class contains annotations which describe the input we collect
from our annotators and are also suitable to document comments and 
disagreements
between annotators.

And the second class of annotations contain standard linguistic annotations
such as sentences, tokens, entities, chunks, parses, etc.

The idea is that the annotation in the second class can be automatically
be derived from the annotations in the first class. In case the article 
is not
completely labeled the statistic models could fill the gap.

For example, we could ask the annotators to label token splits, form these
token splits we can derive the actual token annotations. For english texts
the annotation ui could make use of the alpha num optimization and only
ask the user for questionable token splits.

A similar approach could be done for sentence annotations.

For named entity annotations the user could do BIO style token labeling 
through a
special ui, similar to the one in Walter. The BIO labels can then be 
used to compute the
name spans.

Our models can either be trained directly on the derived annotations, or 
we add a sentence level
annotation where users needs to confirm that the entire sentence is 
labeled correctly, for example
all person annotation are marked in this sentence.

Any opinions?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/25/11 12:57 AM, Olivier Grisel wrote:
> +1 for simple restful service  with JAX-RS resources. Jersey is a good
> implementation.
>
I am also +1 for a simple restful service.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

+1 for simple restful service  with JAX-RS resources. Jersey is a good
implementation.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Henry Saputra <he...@gmail.com>.

I would think RESTful based services should work.

Most programming languages and framework have built-in HTTP support to
help submit and process HTTP requests.

- Henry

On Fri, Jun 24, 2011 at 3:16 PM, Hannes Korte
<ha...@iais.fraunhofer.de> wrote:
> On 24.06.2011 23:50, Jörn Kottmann wrote:
>>>
>>> Ok, till then I'll work with some dummy documents. By the way, who wants
>>> to build the active learning component, which decides which sentences
>>> come next?
>>
>> I added a description about the initial interface the Corpus Server
>> could have to the proposal, please review.
>>
>> Are there any opinions which technology we should use for remote
>> interface?
>>
>> For example:
>> * RESTful service
>> * Web Service
>> * Thrift
>>
>> Maybe we should go ahead and make a dummy version of this server so
>> you have something to work with.
>
> That would be great. I thought of a simple RESTful service with Jersey, but
> Thrift looks interesting, too. If you are already familiar with it, I'd like
> to get introduced to it as the tutorial is rather incomplete.. :)
> http://thrift.apache.org/tutorial/#t5
>
> Hannes
>

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/30 Olivier Grisel <ol...@ensta.org>

> 2011/6/30 Tommaso Teofili <to...@gmail.com>:
> > 2011/6/29 Jörn Kottmann <ko...@gmail.com>
> >
> >> On 6/29/11 3:51 PM, Tommaso Teofili wrote:
> >>
> >>> I think I can better help with the Corpus Server, as I've also some
> >>> experience with Lucene (by the way, I imagine Lucas could be used to
> save
> >>> CASes inside the index) I think I can help with CAS searching and task
> >>> queueing (with UIMA AAE process).
> >>>
> >>
> >> +1, I am pretty sure we can reuse Lucas, or eventually adapt a little to
> be
> >> suitable for our needs. We need to have one index loaded and
> concurrently
> >> update and query it. Maybe we need to modify Lucas a little to give as a
> >> reference
> >> to the index its writing the CASes to.
> >>
> >
> > right, I will start inspecting if it's worth modifying it or rewrite a
> new
> > component; it may be useful also to switch it to latest Lucene (as it's
> > 2.9.3 now).
>
> Why not switch directly to lucene 3.2?
>
>
I meant Lucas is using Lucene 2.9.3 and it'd be good if we can move Lucas to
use latest Lucene (by the way, I think 3.3 should be on air in a few hours)

Tommaso


>  --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/30 Tommaso Teofili <to...@gmail.com>:
> 2011/6/29 Jörn Kottmann <ko...@gmail.com>
>
>> On 6/29/11 3:51 PM, Tommaso Teofili wrote:
>>
>>> I think I can better help with the Corpus Server, as I've also some
>>> experience with Lucene (by the way, I imagine Lucas could be used to save
>>> CASes inside the index) I think I can help with CAS searching and task
>>> queueing (with UIMA AAE process).
>>>
>>
>> +1, I am pretty sure we can reuse Lucas, or eventually adapt a little to be
>> suitable for our needs. We need to have one index loaded and concurrently
>> update and query it. Maybe we need to modify Lucas a little to give as a
>> reference
>> to the index its writing the CASes to.
>>
>
> right, I will start inspecting if it's worth modifying it or rewrite a new
> component; it may be useful also to switch it to latest Lucene (as it's
> 2.9.3 now).

Why not switch directly to lucene 3.2?

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/30/11 2:58 PM, Tommaso Teofili wrote:
> Regarding Apache Derby "CorpusStore" I think it'd be good to create a
> CorpusStore interface so that implementations can be created/switched easily
> in the future.

We decided on the list to use something which is easy to handle for our
OpenNLP Annotations project. So that is the motivation behind this choice.

+1, in other projects people might use different storage systems

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/30/11 2:58 PM, Tommaso Teofili wrote:
>> >  I hope the required changes to Lucas are not that big, it already contains
>> >  a lot of functionality we need, for example Type System to Lucene Index
>> >  mapping.
>> >
>> >
>> >
> I tried to have a look some weeks ago, the most important change is in the
> new declarative Token API used in Lucene 3+ however it should not be so
> hard.

I had a quick look at the code, and I think we can extend 
LuceneDocumentAE, this
way we can control the index writing. But maybe that is not even 
necessary since
the index is written to disk, and an Index Reader can concurrently 
access it.

The javadoc of two classes explains the index handling:
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/index/IndexWriter.html
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/index/IndexReader.html

Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/30 Jörn Kottmann <ko...@gmail.com>

> On 6/30/11 12:18 PM, Tommaso Teofili wrote:
>
>> 2011/6/29 Jörn Kottmann<ko...@gmail.com>
>>
>>  On 6/29/11 3:51 PM, Tommaso Teofili wrote:
>>>
>>>  I think I can better help with the Corpus Server, as I've also some
>>>> experience with Lucene (by the way, I imagine Lucas could be used to
>>>> save
>>>> CASes inside the index) I think I can help with CAS searching and task
>>>> queueing (with UIMA AAE process).
>>>>
>>>>  +1, I am pretty sure we can reuse Lucas, or eventually adapt a little
>>> to be
>>> suitable for our needs. We need to have one index loaded and concurrently
>>> update and query it. Maybe we need to modify Lucas a little to give as a
>>> reference
>>> to the index its writing the CASes to.
>>>
>>>  right, I will start inspecting if it's worth modifying it or rewrite a
>> new
>> component; it may be useful also to switch it to latest Lucene (as it's
>> 2.9.3 now).
>>
>>
> I hope the required changes to Lucas are not that big, it already contains
> a lot of functionality we need, for example Type System to Lucene Index
> mapping.
>
>
>
I tried to have a look some weeks ago, the most important change is in the
new declarative Token API used in Lucene 3+ however it should not be so
hard.
Regarding Apache Derby "CorpusStore" I think it'd be good to create a
CorpusStore interface so that implementations can be created/switched easily
in the future.
Tommaso

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/30/11 12:18 PM, Tommaso Teofili wrote:
> 2011/6/29 Jörn Kottmann<ko...@gmail.com>
>
>> On 6/29/11 3:51 PM, Tommaso Teofili wrote:
>>
>>> I think I can better help with the Corpus Server, as I've also some
>>> experience with Lucene (by the way, I imagine Lucas could be used to save
>>> CASes inside the index) I think I can help with CAS searching and task
>>> queueing (with UIMA AAE process).
>>>
>> +1, I am pretty sure we can reuse Lucas, or eventually adapt a little to be
>> suitable for our needs. We need to have one index loaded and concurrently
>> update and query it. Maybe we need to modify Lucas a little to give as a
>> reference
>> to the index its writing the CASes to.
>>
> right, I will start inspecting if it's worth modifying it or rewrite a new
> component; it may be useful also to switch it to latest Lucene (as it's
> 2.9.3 now).
>

I hope the required changes to Lucas are not that big, it already contains
a lot of functionality we need, for example Type System to Lucene Index 
mapping.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/29 Jörn Kottmann <ko...@gmail.com>

> On 6/29/11 3:51 PM, Tommaso Teofili wrote:
>
>> I think I can better help with the Corpus Server, as I've also some
>> experience with Lucene (by the way, I imagine Lucas could be used to save
>> CASes inside the index) I think I can help with CAS searching and task
>> queueing (with UIMA AAE process).
>>
>
> +1, I am pretty sure we can reuse Lucas, or eventually adapt a little to be
> suitable for our needs. We need to have one index loaded and concurrently
> update and query it. Maybe we need to modify Lucas a little to give as a
> reference
> to the index its writing the CASes to.
>

right, I will start inspecting if it's worth modifying it or rewrite a new
component; it may be useful also to switch it to latest Lucene (as it's
2.9.3 now).


>
> The index must be updated, when a CAS is added and when a CAS is changed.
> That should be simple to do. Then we have a search method which returns a
> list
> of matched CAS references, that should also be easy to implement with
> Lucene APIs.
>
> Would be nice if you can open a jira for this, and then attach a patch.
>
> To implement a task queue I think we should use a DB table to keep track of
> what should be handed out, and what was already sent to a client. In case
> an
> item is not returned in time, we might need to reschedule it.
>
> I think it would be good to create three jiras:
> - one to add search support based on Lucas
> - one to use derby for CAS persistence instead of the simple java.util.Map
> used by the dummy
> - and one issue to add support to create a task queue
>
> What do you think?
>

It sounds good, I'll open such jiras.
Regards,
Tommaso

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/29/11 3:51 PM, Tommaso Teofili wrote:
> I think I can better help with the Corpus Server, as I've also some
> experience with Lucene (by the way, I imagine Lucas could be used to save
> CASes inside the index) I think I can help with CAS searching and task
> queueing (with UIMA AAE process).

+1, I am pretty sure we can reuse Lucas, or eventually adapt a little to be
suitable for our needs. We need to have one index loaded and concurrently
update and query it. Maybe we need to modify Lucas a little to give as a 
reference
to the index its writing the CASes to.

The index must be updated, when a CAS is added and when a CAS is changed.
That should be simple to do. Then we have a search method which returns 
a list
of matched CAS references, that should also be easy to implement with 
Lucene APIs.

Would be nice if you can open a jira for this, and then attach a patch.

To implement a task queue I think we should use a DB table to keep track of
what should be handed out, and what was already sent to a client. In case an
item is not returned in time, we might need to reschedule it.

I think it would be good to create three jiras:
- one to add search support based on Lucas
- one to use derby for CAS persistence instead of the simple 
java.util.Map used by the dummy
- and one issue to add support to create a task queue

What do you think?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/29 Jörn Kottmann <ko...@gmail.com>

> On 6/29/11 3:21 PM, Tommaso Teofili wrote:
>
>> if you need it I would be happy to help on such a part.
>>>
>>
> Yes that would be nice.
> Currently we could need help on the Corpus Server or the Wikinews Importer.
>
> Is there anything specific you would like to do?
>

I think I can better help with the Corpus Server, as I've also some
experience with Lucene (by the way, I imagine Lucas could be used to save
CASes inside the index) I think I can help with CAS searching and task
queueing (with UIMA AAE process).

Tommaso

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/29/11 3:21 PM, Tommaso Teofili wrote:
>> if you need it I would be happy to help on such a part.

Yes that would be nice.
Currently we could need help on the Corpus Server or the Wikinews Importer.

Is there anything specific you would like to do?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

2011/6/29 Jörn Kottmann <ko...@gmail.com>

> On 6/29/11 12:46 AM, Jörn Kottmann wrote:
>
>> A work item is just a CAS, and the annotator should enhance it with
>> annotation if possible.
>> The work queue defines what kind of task the CAS belongs to, we might have
>> one work queue which contains
>> CASes for named entity labeling and another one for pos labeling.
>>
>
> I extended the proposal about this a bit, please have a look there:
>
> https://cwiki.apache.org/**OPENNLP/opennlp-annotations.**html<https://cwiki.apache.org/OPENNLP/opennlp-annotations.html>
>
> We could also use such task queues to create an interface to UIMA tooling,
> e.g.
> for training or re-labeling everything in the queue automatically.
>
> if you need it I would be happy to help on such a part.
Tommaso


>  Jörn
>
>
>

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/29/11 12:46 AM, Jörn Kottmann wrote:
> A work item is just a CAS, and the annotator should enhance it with 
> annotation if possible.
> The work queue defines what kind of task the CAS belongs to, we might 
> have one work queue which contains
> CASes for named entity labeling and another one for pos labeling. 

I extended the proposal about this a bit, please have a look there:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

We could also use such task queues to create an interface to UIMA 
tooling, e.g.
for training or re-labeling everything in the queue automatically.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/29/11 12:37 AM, Hannes Korte wrote:
> On 29.06.2011 00:13, Jörn Kottmann wrote:
>> I thought a bit more about work scheduling and believe it might be
>> nice to put this functionality directly into the Corpus Server. Then
>> a labeling task queue can be defined by a search query to match CASes
>> which should be tagged, and an annotation tool can just ask the
>> Corpus Server for the next work item.
>
> +1, that's what I planned to do as well. And at a later stage we can 
> integrate a real active learning component there. Such a work item 
> consists of a CAS and some sentence identifier?

A work item is just a CAS, and the annotator should enhance it with 
annotation if possible.
The work queue defines what kind of task the CAS belongs to, we might 
have one work queue which contains
CASes for named entity labeling and another one for pos labeling.

The search query could search for CASes with a certain minimum  text 
length, a minimum number of pre-detect entitities
and no-human labeled entities.

I will try to extend the proposal tomorrow with a sample type system and 
a description on how things could be labeled, based
on the discussion we had here a few days ago.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 29.06.2011 00:13, Jörn Kottmann wrote:
> I thought a bit more about work scheduling and believe it might be
> nice to put this functionality directly into the Corpus Server. Then
> a labeling task queue can be defined by a search query to match CASes
> which should be tagged, and an annotation tool can just ask the
> Corpus Server for the next work item.

+1, that's what I planned to do as well. And at a later stage we can 
integrate a real active learning component there. Such a work item 
consists of a CAS and some sentence identifier?

Hannes

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/28/11 11:57 PM, Hannes Korte wrote:
> Yes, I use the coffee-maven-plugin for the compile step. As soon as 
> there is something to see I will commit the first prototype. 

Nice, we could start with something very simple, and just extend it by a 
series of patches. An initial dummy
could search for CASes in the corpus server dummy, get one and maybe 
just display the text.
This way everyone has a chance to see what is done and can comment or 
help out.

It would be nice if you can crate a jira which gets us started with the 
Corpus Refiner.

I thought a bit more about work scheduling and believe it might be nice
to put this functionality directly into the Corpus Server. Then a 
labeling task queue can be
defined by a search query to match CASes which should be tagged, and an
annotation tool can just ask the Corpus Server for the next work item.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 28.06.2011 23:25, Jörn Kottmann wrote:
> On 6/28/11 10:38 PM, Hannes Korte wrote:
>> I think we don't need the Corporate CLA for the Walter stuff. I won't
>> use the Walter JavaScript code for our GUI, because I started
>> implementing our GUI in CoffeeScript instead of JavaScript. I decided
>> to start from scratch, because our GUI differs in many aspects from
>> the current Walter.
>> ...
>
> Can we integrate the CoffeeScript to JavaScript compile step into our
> maven based build?

Yes, I use the coffee-maven-plugin for the compile step. As soon as 
there is something to see I will commit the first prototype.

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/28/11 10:38 PM, Hannes Korte wrote:
> I think we don't need the Corporate CLA for the Walter stuff. I won't 
> use the Walter JavaScript code for our GUI, because I started 
> implementing our GUI in CoffeeScript instead of JavaScript. I decided 
> to start from scratch, because our GUI differs in many aspects from 
> the current Walter:
>
>  - only one sentence at a time
>  - complete keyboard navigation
>  - no relations
>  - token-wise annotation, not character-wise
>  - no file up- and download
>
> But, just to be on the save side, I already handed the CLAs to my boss 
> and she handed it to our lawyers to check it. 

The Contributor License Agreement is an agreement between you and the 
ASF, read it and if you agree send
a signed version to the ASF secretary.  The CCLA might not be necessary, 
but might be good to have and
needs to be signed by your employer.

Anyway, when you start from scratch and start with small contributions
the ASF license flag in jira is enough to start with. This way we can 
start accepting patches from you and that
is actually the process to become a committer here.

Can we integrate the CoffeeScript to JavaScript compile step into our 
maven based build?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 28.06.2011 01:35, Olivier Grisel wrote:
> 2011/6/27 Jörn Kottmann<ko...@gmail.com>:
>> For Walter
>> me might have to
>> do a little IP clearance.
>
> Yes, Hannes you should start doing the paperwork ASAP: see
> http://www.apache.org/licenses/#clas (if your a not yet an Apache
> Committer with a corporate license agreement from your employer).

I think we don't need the Corporate CLA for the Walter stuff. I won't 
use the Walter JavaScript code for our GUI, because I started 
implementing our GUI in CoffeeScript instead of JavaScript. I decided to 
start from scratch, because our GUI differs in many aspects from the 
current Walter:

  - only one sentence at a time
  - complete keyboard navigation
  - no relations
  - token-wise annotation, not character-wise
  - no file up- and download

But, just to be on the save side, I already handed the CLAs to my boss 
and she handed it to our lawyers to check it.

Hannes

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/27 Jörn Kottmann <ko...@gmail.com>:
> On 6/25/11 12:16 AM, Hannes Korte wrote:
>>>
>>> Maybe we should go ahead and make a dummy version of this server so
>>> you have something to work with.
>>
>> That would be great. I thought of a simple RESTful service with Jersey,
>> but Thrift looks interesting, too. If you are already familiar with it, I'd
>> like to get introduced to it as the tutorial is rather incomplete.. :)
>> http://thrift.apache.org/tutorial/#t5
>
> OK, I already started a little on the Corpus Server dummy. If there are no
> objections I will put it
> in the sandbox.
>
> Maybe we should make a few jiras to plan and assign the first tasks.
>
> - Get the Corpus Server dummy ready, so it can be used by other parts
> - Create a wikinews importer which can write to the corpus server
> - Add a first version of the Corpus Refiner / Walter to the sandbox
>
> The second tasks could be done easily with code already created by Olivier,
> and the Corpus Refiner could be based on the Walter code created by Hannes.
>
> I will create a jira now for the corpus server, feel free to create jiras
> for the wikinews importer
> and corpus refiner / walter, we can then add that to the sandbox.

+1, thanks for moving forward.

> For Walter
> me might have to
> do a little IP clearance.

Yes, Hannes you should start doing the paperwork ASAP: see
http://www.apache.org/licenses/#clas (if your a not yet an Apache
Committer with a corporate license agreement from your employer).

FYI I have already done this (for the Stanbol project).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/25/11 12:16 AM, Hannes Korte wrote:
>>
>> Maybe we should go ahead and make a dummy version of this server so
>> you have something to work with.
>
> That would be great. I thought of a simple RESTful service with 
> Jersey, but Thrift looks interesting, too. If you are already familiar 
> with it, I'd like to get introduced to it as the tutorial is rather 
> incomplete.. :) http://thrift.apache.org/tutorial/#t5 

OK, I already started a little on the Corpus Server dummy. If there are 
no objections I will put it
in the sandbox.

Maybe we should make a few jiras to plan and assign the first tasks.

- Get the Corpus Server dummy ready, so it can be used by other parts
- Create a wikinews importer which can write to the corpus server
- Add a first version of the Corpus Refiner / Walter to the sandbox

The second tasks could be done easily with code already created by Olivier,
and the Corpus Refiner could be based on the Walter code created by Hannes.

I will create a jira now for the corpus server, feel free to create 
jiras for the wikinews importer
and corpus refiner / walter, we can then add that to the sandbox. For 
Walter me might have to
do a little IP clearance.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 24.06.2011 23:50, Jörn Kottmann wrote:
>> Ok, till then I'll work with some dummy documents. By the way, who wants
>> to build the active learning component, which decides which sentences
>> come next?
>
> I added a description about the initial interface the Corpus Server
> could have to the proposal, please review.
>
> Are there any opinions which technology we should use for remote interface?
>
> For example:
> * RESTful service
> * Web Service
> * Thrift
>
> Maybe we should go ahead and make a dummy version of this server so
> you have something to work with.

That would be great. I thought of a simple RESTful service with Jersey, 
but Thrift looks interesting, too. If you are already familiar with it, 
I'd like to get introduced to it as the tutorial is rather incomplete.. 
:) http://thrift.apache.org/tutorial/#t5

Hannes

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 1:38 PM, Hannes Korte wrote:
>> >  My stuff is not following the new design: in particular it treats
>> >  sentences as individual sentences. Maybe you should go on from the
>> >  existing Walter design that treats CAS as individual, multi-sentences
>> >  documents instead and try to align it with the tooling available in
>> >  opennlp-uima: but we need to agree on a CAS type system first. I don't
>> >  know the opennlp-uima myself and won't have time to invest more effort
>> >  on this project before mid-july unfortunately.
>> >  
> Ok, till then I'll work with some dummy documents. By the way, who wants
> to build the active learning component, which decides which sentences
> come next?

I added a description about the initial interface the Corpus Server 
could have
to the proposal, please review.

Are there any opinions which technology we should use for remote interface?

For example:
* RESTful service
* Web Service
* Thrift

Maybe we should go ahead and make a dummy version of this server so
you have something to work with.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

We should have a look at phrase detectives as a sample:
http://anawiki.essex.ac.uk/phrasedetectives/

They have a short tutorial/test a user needs to do, they have a leader 
board,
progress made by the user is tracked.

Jörn

On 6/24/11 1:47 PM, Jörn Kottmann wrote:
> On 6/24/11 1:38 PM, Hannes Korte wrote:
>> On 06/24/2011 11:54 AM, Olivier Grisel wrote:
>>> 2011/6/24 Hannes Korte<ha...@iais.fraunhofer.de>:
>>>> On 24.06.2011 10:03, Jörn Kottmann wrote:
>>>>> Hannes and Olivier, do you want to take over the part about the 
>>>>> web based
>>>>> annotation tooling? I called it for now Corpus Refiner, but we can of
>>>>> course change
>>>>> the name to something else.
>>>> Yes, I'll try to find some time in the next days to have a look at 
>>>> what
>>>> Olivier already committed and to work on the javascript part of the 
>>>> webGUI.
>>> My stuff is not following the new design: in particular it treats
>>> sentences as individual sentences. Maybe you should go on from the
>>> existing Walter design that treats CAS as individual, multi-sentences
>>> documents instead and try to align it with the tooling available in
>>> opennlp-uima: but we need to agree on a CAS type system first. I don't
>>> know the opennlp-uima myself and won't have time to invest more effort
>>> on this project before mid-july unfortunately.
>>>
>> Ok, till then I'll work with some dummy documents. By the way, who wants
>> to build the active learning component, which decides which sentences
>> come next?
> Would it be possible for you to contribute the Walter code to OpenNLP?
>
> In a previous project I used some kind of filtering to find CASes 
> which should be
> annotated. Maybe that is an approach which could work well for us here 
> too.
> The corpus server will index all CASes with annotations, and then the 
> corpus refiner or walter
> server can query the index to find CASes it should hand out to 
> annotators. Sure this logic
> would be task dependent.
>
> This could also be done in a more controlled way, where we insert 
> annotations into the CAS which
> say that this area should be labeled manually.
>
> Lets update the OpenNLP Annotations proposal a little to describe 
> these things.
>
> Jörn
>

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 1:38 PM, Hannes Korte wrote:
> On 06/24/2011 11:54 AM, Olivier Grisel wrote:
>> 2011/6/24 Hannes Korte<ha...@iais.fraunhofer.de>:
>>> On 24.06.2011 10:03, Jörn Kottmann wrote:
>>>> Hannes and Olivier, do you want to take over the part about the web based
>>>> annotation tooling? I called it for now Corpus Refiner, but we can of
>>>> course change
>>>> the name to something else.
>>> Yes, I'll try to find some time in the next days to have a look at what
>>> Olivier already committed and to work on the javascript part of the webGUI.
>> My stuff is not following the new design: in particular it treats
>> sentences as individual sentences. Maybe you should go on from the
>> existing Walter design that treats CAS as individual, multi-sentences
>> documents instead and try to align it with the tooling available in
>> opennlp-uima: but we need to agree on a CAS type system first. I don't
>> know the opennlp-uima myself and won't have time to invest more effort
>> on this project before mid-july unfortunately.
>>
> Ok, till then I'll work with some dummy documents. By the way, who wants
> to build the active learning component, which decides which sentences
> come next?
Would it be possible for you to contribute the Walter code to OpenNLP?

In a previous project I used some kind of filtering to find CASes which 
should be
annotated. Maybe that is an approach which could work well for us here too.
The corpus server will index all CASes with annotations, and then the 
corpus refiner or walter
server can query the index to find CASes it should hand out to 
annotators. Sure this logic
would be task dependent.

This could also be done in a more controlled way, where we insert 
annotations into the CAS which
say that this area should be labeled manually.

Lets update the OpenNLP Annotations proposal a little to describe these 
things.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 06/24/2011 11:54 AM, Olivier Grisel wrote:
> 2011/6/24 Hannes Korte <ha...@iais.fraunhofer.de>:
>> On 24.06.2011 10:03, Jörn Kottmann wrote:
>>>
>>> Hannes and Olivier, do you want to take over the part about the web based
>>> annotation tooling? I called it for now Corpus Refiner, but we can of
>>> course change
>>> the name to something else.
>>
>> Yes, I'll try to find some time in the next days to have a look at what
>> Olivier already committed and to work on the javascript part of the webGUI.
> 
> My stuff is not following the new design: in particular it treats
> sentences as individual sentences. Maybe you should go on from the
> existing Walter design that treats CAS as individual, multi-sentences
> documents instead and try to align it with the tooling available in
> opennlp-uima: but we need to agree on a CAS type system first. I don't
> know the opennlp-uima myself and won't have time to invest more effort
> on this project before mid-july unfortunately.
> 
Ok, till then I'll work with some dummy documents. By the way, who wants
to build the active learning component, which decides which sentences
come next?

Hannes

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/24 Hannes Korte <ha...@iais.fraunhofer.de>:
> On 24.06.2011 10:03, Jörn Kottmann wrote:
>>
>> Hannes and Olivier, do you want to take over the part about the web based
>> annotation tooling? I called it for now Corpus Refiner, but we can of
>> course change
>> the name to something else.
>
> Yes, I'll try to find some time in the next days to have a look at what
> Olivier already committed and to work on the javascript part of the webGUI.

My stuff is not following the new design: in particular it treats
sentences as individual sentences. Maybe you should go on from the
existing Walter design that treats CAS as individual, multi-sentences
documents instead and try to align it with the tooling available in
opennlp-uima: but we need to agree on a CAS type system first. I don't
know the opennlp-uima myself and won't have time to invest more effort
on this project before mid-july unfortunately.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Hannes Korte <ha...@iais.fraunhofer.de>.

On 24.06.2011 10:03, Jörn Kottmann wrote:
> Hannes and Olivier, do you want to take over the part about the web based
> annotation tooling? I called it for now Corpus Refiner, but we can of
> course change
> the name to something else.

Yes, I'll try to find some time in the next days to have a look at what 
Olivier already committed and to work on the javascript part of the webGUI.

Hannes

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

Hello,

lets update the proposal a little and extend it with the things we
discussed here.

Hannes and Olivier, do you want to take over the part about the web based
annotation tooling? I called it for now Corpus Refiner, but we can of 
course change
the name to something else.

You should have editing rights now when you login with a confluence user.

Here is the link again:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

Jörn

> 2011/6/24 James Kosin<ja...@gmail.com>:
>> Olivier,
>>
>> No main() in the classes.  So, how does one get the collection of
>> articles started?
> It's meant to be used as a library. For instance, it is used by the
> following custom pig Loader:
>
>    https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java
>
> which is in turn called in pig scripts such as:
>
>   https://github.com/ogrisel/pignlproc/blob/master/examples/extract_links.pig
>
> Apache Pig is scripting language and runtime environment to perform
> distributed data analysis on an Apache Hadoop (HDFS + MapReduce)
> cluster.
>
>    http://pig.apache.org/
>

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/7/4 Jörn Kottmann <ko...@gmail.com>:
> On 7/4/11 7:41 PM, Olivier Grisel wrote:
>>
>> 2011/7/4 Jörn Kottmann<ko...@gmail.com>:
>>>
>>> On 7/4/11 7:20 PM, Olivier Grisel wrote:
>>>>
>>>> Keeping the correct link position from the original markup while
>>>> cleaning it can be tricky though. Be careful when tweaking the parser.
>>>> Maybe the Span helper classes from OpenNLP could help make this code
>>>> more robust.
>>>
>>> I wonder how important the links are here, because we do not want to
>>> throw
>>> away sentences which do not have links covering their entities.
>>>
>>> But I believe the links might be very interesting for entity
>>> identification,
>>> if lets say a person name is labeled, and also covered by a link. The
>>> link
>>> can be used to identify the person mention.
>>
>> Yes this is exactly what pignlproc is doing. Building a NameFinder
>> training corpus automatically from the link position info from the
>> wikipedia articles and the entity typing info from the DBpedia dumps
>> (this articles is a person, this one is an organization....).
>>
>>> And after we have a few manually labeled articles we can use the links to
>>> generate special features which are passed to the name finder.
>>>
>>> So in the end, do we just generate an annotation for every link?!
>>
>> This is very important to build a preannotated corpus to boostrap and
>> train a first version of OpenNLP models automatically. This model can
>> then be used to annotate new text without any annotations and human
>> refinement can then be used to produce gold annotations rapidly by
>> mostly validating / fixing existing annotations rather that annotating
>> text from scratch.
>>
>> Links can also be useful to train a NE disambiguation training corpus.
>>
>
> The automatic labeling can be supported by features generated for the link
> annotations, this way I guess the name finder performs much better, but
> evaluation will show that.

Yes but such features are useless to be able to tag other named entity
occurrences for which we don't have any kind of link data.

If the data is linked to a wikipedia page, you can just do a sparql
query on DBpedia to know the type, no need for any kind of NLP (or
better build a local index of entity types using solr based on the
DBpedia dump).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: AnnotatingMarkupParser questions (WAS: OpenNLP Annotations Proposal=

Posted by Olivier Grisel <ol...@ensta.org>.

2011/7/4 Jörn Kottmann <ko...@gmail.com>:
> Do you know how to filter tags like this one: {{date|November 13, 2004}} ?
>
> The current implementation just turns it into {{date}}, but they must be
> either
> be replaced by the date should just be removed.

Yes you should try to had support the "date template" to come up with
a handler that can extract the interesting part "November 13, 2004"
and put it into the text buffer. I knew this bug but did not find the
time to investigate, sorry.

> I have the same issue for byline, etc.
>
> The wikinews dump also contains pages which are not news articles,
> I filter them now based on the availability of the {{publish}} tag, and then
> cut the article after the text ends, based on the availability of headings,
> or tags.
>
> I changed the link handling a bit, because many links seem to be inter-wiki
> links
> which the current implementation filters out.

Indeed. We might want to make this behavior controllable through
constructor arguments.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

AnnotatingMarkupParser questions (WAS: OpenNLP Annotations Proposal=

Posted by Jörn Kottmann <ko...@gmail.com>.

Do you know how to filter tags like this one: {{date|November 13, 2004}} ?

The current implementation just turns it into {{date}}, but they must be 
either
be replaced by the date should just be removed.

I have the same issue for byline, etc.

The wikinews dump also contains pages which are not news articles,
I filter them now based on the availability of the {{publish}} tag, and then
cut the article after the text ends, based on the availability of headings,
or tags.

I changed the link handling a bit, because many links seem to be 
inter-wiki links
which the current implementation filters out.

Jörn

On 7/4/11 7:57 PM, Jörn Kottmann wrote:
> On 7/4/11 7:41 PM, Olivier Grisel wrote:
>> 2011/7/4 Jörn Kottmann<ko...@gmail.com>:
>>> On 7/4/11 7:20 PM, Olivier Grisel wrote:
>>>> Keeping the correct link position from the original markup while
>>>> cleaning it can be tricky though. Be careful when tweaking the parser.
>>>> Maybe the Span helper classes from OpenNLP could help make this code
>>>> more robust.
>>> I wonder how important the links are here, because we do not want to 
>>> throw
>>> away sentences which do not have links covering their entities.
>>>
>>> But I believe the links might be very interesting for entity 
>>> identification,
>>> if lets say a person name is labeled, and also covered by a link. 
>>> The link
>>> can be used to identify the person mention.
>> Yes this is exactly what pignlproc is doing. Building a NameFinder
>> training corpus automatically from the link position info from the
>> wikipedia articles and the entity typing info from the DBpedia dumps
>> (this articles is a person, this one is an organization....).
>>
>>> And after we have a few manually labeled articles we can use the 
>>> links to
>>> generate special features which are passed to the name finder.
>>>
>>> So in the end, do we just generate an annotation for every link?!
>> This is very important to build a preannotated corpus to boostrap and
>> train a first version of OpenNLP models automatically. This model can
>> then be used to annotate new text without any annotations and human
>> refinement can then be used to produce gold annotations rapidly by
>> mostly validating / fixing existing annotations rather that annotating
>> text from scratch.
>>
>> Links can also be useful to train a NE disambiguation training corpus.
>>
>
> The automatic labeling can be supported by features generated for the 
> link
> annotations, this way I guess the name finder performs much better, but
> evaluation will show that.
>
> Jörn
>

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/4/11 7:41 PM, Olivier Grisel wrote:
> 2011/7/4 Jörn Kottmann<ko...@gmail.com>:
>> On 7/4/11 7:20 PM, Olivier Grisel wrote:
>>> Keeping the correct link position from the original markup while
>>> cleaning it can be tricky though. Be careful when tweaking the parser.
>>> Maybe the Span helper classes from OpenNLP could help make this code
>>> more robust.
>> I wonder how important the links are here, because we do not want to throw
>> away sentences which do not have links covering their entities.
>>
>> But I believe the links might be very interesting for entity identification,
>> if lets say a person name is labeled, and also covered by a link. The link
>> can be used to identify the person mention.
> Yes this is exactly what pignlproc is doing. Building a NameFinder
> training corpus automatically from the link position info from the
> wikipedia articles and the entity typing info from the DBpedia dumps
> (this articles is a person, this one is an organization....).
>
>> And after we have a few manually labeled articles we can use the links to
>> generate special features which are passed to the name finder.
>>
>> So in the end, do we just generate an annotation for every link?!
> This is very important to build a preannotated corpus to boostrap and
> train a first version of OpenNLP models automatically. This model can
> then be used to annotate new text without any annotations and human
> refinement can then be used to produce gold annotations rapidly by
> mostly validating / fixing existing annotations rather that annotating
> text from scratch.
>
> Links can also be useful to train a NE disambiguation training corpus.
>

The automatic labeling can be supported by features generated for the link
annotations, this way I guess the name finder performs much better, but
evaluation will show that.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/7/4 Jörn Kottmann <ko...@gmail.com>:
> On 7/4/11 7:20 PM, Olivier Grisel wrote:
>>
>> Keeping the correct link position from the original markup while
>> cleaning it can be tricky though. Be careful when tweaking the parser.
>> Maybe the Span helper classes from OpenNLP could help make this code
>> more robust.
>
> I wonder how important the links are here, because we do not want to throw
> away sentences which do not have links covering their entities.
>
> But I believe the links might be very interesting for entity identification,
> if lets say a person name is labeled, and also covered by a link. The link
> can be used to identify the person mention.

Yes this is exactly what pignlproc is doing. Building a NameFinder
training corpus automatically from the link position info from the
wikipedia articles and the entity typing info from the DBpedia dumps
(this articles is a person, this one is an organization....).

> And after we have a few manually labeled articles we can use the links to
> generate special features which are passed to the name finder.
>
> So in the end, do we just generate an annotation for every link?!

This is very important to build a preannotated corpus to boostrap and
train a first version of OpenNLP models automatically. This model can
then be used to annotate new text without any annotations and human
refinement can then be used to produce gold annotations rapidly by
mostly validating / fixing existing annotations rather that annotating
text from scratch.

Links can also be useful to train a NE disambiguation training corpus.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/4/11 7:20 PM, Olivier Grisel wrote:
> Keeping the correct link position from the original markup while
> cleaning it can be tricky though. Be careful when tweaking the parser.
> Maybe the Span helper classes from OpenNLP could help make this code
> more robust.

I wonder how important the links are here, because we do not want to throw
away sentences which do not have links covering their entities.

But I believe the links might be very interesting for entity identification,
if lets say a person name is labeled, and also covered by a link. The link
can be used to identify the person mention.

And after we have a few manually labeled articles we can use the links to
generate special features which are passed to the name finder.

So in the end, do we just generate an annotation for every link?!

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/7/4 Jörn Kottmann <ko...@gmail.com>:
> On 7/4/11 2:05 PM, Olivier Grisel wrote:
>>
>> Done. See my comment on
>> https://issues.apache.org/jira/browse/OPENNLP-211  for additional info
>> on the integration / usage.
>
> Thanks, doesn't seem that difficult to parse it. Hopefully we have quickly
> a state where it is possible to import the wikinews data in to the corpus
> server, the parsing might need a little fine tuning to give good results.

Keeping the correct link position from the original markup while
cleaning it can be tricky though. Be careful when tweaking the parser.
Maybe the Span helper classes from OpenNLP could help make this code
more robust.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 7/4/11 2:05 PM, Olivier Grisel wrote:
> Done. See my comment on
> https://issues.apache.org/jira/browse/OPENNLP-211  for additional info
> on the integration / usage.

Thanks, doesn't seem that difficult to parse it. Hopefully we have quickly
a state where it is possible to import the wikinews data in to the corpus
server, the parsing might need a little fine tuning to give good results.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/7/4 Jörn Kottmann <ko...@gmail.com>:
> On 6/24/11 1:10 AM, Olivier Grisel wrote:
>>
>> It's meant to be used as a library. For instance, it is used by the
>> following custom pig Loader:
>>
>>
>> https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java
>
> Olivier, would you mind donating this class? If so it would be nice if you
> can attach it to
> this jira issue:
> https://issues.apache.org/jira/browse/OPENNLP-211
>
> Then we can work out an implementation based on it.

Done. See my comment on
https://issues.apache.org/jira/browse/OPENNLP-211 for additional info
on the integration / usage.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/24/11 1:10 AM, Olivier Grisel wrote:
> It's meant to be used as a library. For instance, it is used by the
> following custom pig Loader:
>
>    https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java

Olivier, would you mind donating this class? If so it would be nice if 
you can attach it to
this jira issue:
https://issues.apache.org/jira/browse/OPENNLP-211

Then we can work out an implementation based on it.

Thanks,
Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/24 James Kosin <ja...@gmail.com>:
> Olivier,
>
> No main() in the classes.  So, how does one get the collection of
> articles started?

It's meant to be used as a library. For instance, it is used by the
following custom pig Loader:

  https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/ParsingWikipediaLoader.java

which is in turn called in pig scripts such as:

 https://github.com/ogrisel/pignlproc/blob/master/examples/extract_links.pig

Apache Pig is scripting language and runtime environment to perform
distributed data analysis on an Apache Hadoop (HDFS + MapReduce)
cluster.

  http://pig.apache.org/

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by James Kosin <ja...@gmail.com>.

Olivier,

No main() in the classes.  So, how does one get the collection of
articles started?

James

On 6/22/2011 8:26 AM, Olivier Grisel wrote:
> 2011/6/22 Olivier Grisel <ol...@ensta.org>:
>> Sure, it is here (again the ITextConverter API imposed by gwtwiki is
>> not intuitive so focus on the convert / getWikiLinks methods as entry
>> points when reading the source code):
>>
>>  https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java
> Actually the "convert" method has been renamved to "parse". I need to
> update the javadoc.
>
>

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Olivier Grisel <ol...@ensta.org>:
>
> Sure, it is here (again the ITextConverter API imposed by gwtwiki is
> not intuitive so focus on the convert / getWikiLinks methods as entry
> points when reading the source code):
>
>  https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java

Actually the "convert" method has been renamved to "parse". I need to
update the javadoc.


-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:
> On 6/22/11 10:45 AM, Olivier Grisel wrote:
>>
>> I will (soon?) include a couple of new scripts in pignlproc to extract
>> occurrence contexts of any kind of entities occurring as wikilinks in
>> Wikipedia dumps to load those in a Solr index. I will let you know
>> when that happens.
>
> We definitely need some code to parse the wikipedia articles.
> How do you transform the wiki text to plain text in pignlproc?

I use a mediawiki markup parser from gwtwiki: https://code.google.com/p/gwtwiki/

The API is a non intuitive to use but when I searched for a good
mediawiki parser it was one of the best I found and that had a license
that would be compatible with ASF requirements for dependencies.

> Could we take a similar approach for the annotation project, or maybe
> even share the code which does it?

Sure, it is here (again the ITextConverter API imposed by gwtwiki is
not intuitive so focus on the convert / getWikiLinks methods as entry
points when reading the source code):

  https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/markup/AnnotatingMarkupParser.java

I found empirically that it's able to process 1MB/s, hence roughly
require 1 day to process an English Wikipedia dump. Hence the use of
Apache Pig / Hadoop and EC2 for this kind of tasks: 20 machines => a
bit more that 1h to process the same dump in parallel with the same
pig script.

As said previously, I find Spark very, very promising and that might
be a more maintainable than pig as an integration target as it's is
also more suitable for interactive and iterative tasks as is the case
with NLP / machine learning stuff.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 10:45 AM, Olivier Grisel wrote:
> I will (soon?) include a couple of new scripts in pignlproc to extract
> occurrence contexts of any kind of entities occurring as wikilinks in
> Wikipedia dumps to load those in a Solr index. I will let you know
> when that happens.

We definitely need some code to parse the wikipedia articles.
How do you transform the wiki text to plain text in pignlproc?

Could we take a similar approach for the annotation project, or maybe
even share the code which does it?

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/22 Jörn Kottmann <ko...@gmail.com>:

> I was actually thinking about something similar. Make a small server which
> can host XMI CAS files. CASes have the advantage that they take a way lots
> of complexity when dealing with a text and annotations.
>
> Since we have an UIMA Integration OpenNLP can directly be trained with the
> CASes, in this case we would make a small server component which can do
> the training and then makes the models available via http for example.
>
> It sounds like that a corpus refiner based web ui could be easily attached
> to such a server, and also other tools like the Cas Editor.

I wind the UIMA CAS API much more complicated to work with than
directly working with token-level concepts with the OpenNLP API (i.e.
with arrays of Span). I haven't add a look at the opennlp-uima
subproject though: you probably already have tooling and predefined
type systems that makes interoperability with CAS instance less of a
pain.

> To pre-annotate the articles, we might want to add different types of name
> annotations
>
>> We would like to make a fast binary interface with keyboard shortcuts
>> to focus one sentence at a time. If the user think that all the
>> entities in the sentence are correctly annotated by the model, he/she
>> press "space" and the sentence is marked validated and the focus moves
>> to the next sentence. If the sentence is complete gibberish he/she can
>> discard the sample by pressing "d". The user can also fix individual
>> annotations using the mouse interface before validating the corrected
>> sample.
>>
> Did you discuss to focus on a sentence level? This solution would still
> requires
> that one annotator goes through the entire document. Maybe we have a user
> who wants to fix our wikinews model to detect his entity of choice. Then he
> might want to search for sentences which contain it and only label these.

Adding a keyword filter / search would be very interesting indeed.

> Working on a sentence level also has the advantage that a user can skip a
> sentence
> which contains an entity he is no sure about how it should be labeled.

Yes.

> Did you think of using GWT, it might be a very good fit for OpenNLP bacause
> all here
> have a lot of experience with Java, but maybe not so much experience with
> JS?

In my experience the GWT abstraction layer adds more complexity than
anything else when dealing with lowlevel DOM related concepts such as
introducing new "span" elements around a mouse selection.

I much prefer debugging in JS using libraries such as JQuery and the
firebug debugger even though I am not an experienced JS programmer as
well.

Furthermore Hannes already had a working code base.

> Entity disambiguation would be very nice to have in OpenNLP and I also
> need to work on that soon.

I will (soon?) include a couple of new scripts in pignlproc to extract
occurrence contexts of any kind of entities occurring as wikilinks in
Wikipedia dumps to load those in a Solr index. I will let you know
when that happens.

>> Comments and pull-requests on the corpus-refiner prototype welcome. I
>> plan to go on working on this project from time to time. AFAIK Hannes
>> won't have time to work on the JS layer in the short term but it
>> should be at least possible to have a first version of the command
>> line based interface rather quickly.
>
> Yes, it would be nice to have such a tool, but for OpenNLP Annotations it
> must be more focused on crowd sourcing and to work well with a small /
> medium size group
> of people.

I agree. The CLI (& Swing) interface is still useful to validate the
workflow concepts though.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/10/11 4:12 PM, Olivier Grisel wrote:
> Hi all,
>
> Here is a short report of the Berlin Buzzwords Semantic / NLP
> Hackathon that happened on Wednesday and yesterday at Neofonie and was
> related to this corpus annotation project.
>
> Basically we worked in small 2-3 people groups on various related topics.
>
> Hannes introduced a HTML / JS based tool named Walter to visualize and
> edit named entities and (optionally typed relations between those
> entities). Demo is here:
>
>    http://tmdemo.iais.fraunhofer.de/walter/
>
> Currently Walter walks with UIMA / XMI formatted files as input /
> output using a java servlet deployed on a tomcat server for instance.
> The plan is to adapt it to a corpus annotation validation / refinement
> pattern: feed it with a partially annotated corpus coming from the
> output of a OpenNLP pre-trained on the annotations extracted from
> Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
> multilingual models.
>
I was actually thinking about something similar. Make a small server which
can host XMI CAS files. CASes have the advantage that they take a way lots
of complexity when dealing with a text and annotations.

Since we have an UIMA Integration OpenNLP can directly be trained with the
CASes, in this case we would make a small server component which can do
the training and then makes the models available via http for example.

It sounds like that a corpus refiner based web ui could be easily attached
to such a server, and also other tools like the Cas Editor.

To pre-annotate the articles, we might want to add different types of 
name annotations

> We would like to make a fast binary interface with keyboard shortcuts
> to focus one sentence at a time. If the user think that all the
> entities in the sentence are correctly annotated by the model, he/she
> press "space" and the sentence is marked validated and the focus moves
> to the next sentence. If the sentence is complete gibberish he/she can
> discard the sample by pressing "d". The user can also fix individual
> annotations using the mouse interface before validating the corrected
> sample.
>
Did you discuss to focus on a sentence level? This solution would still 
requires
that one annotator goes through the entire document. Maybe we have a user
who wants to fix our wikinews model to detect his entity of choice. Then he
might want to search for sentences which contain it and only label these.

Working on a sentence level also has the advantage that a user can skip 
a sentence
which contains an entity he is no sure about how it should be labeled.

> Up arrow and down arrow allow the user to move to focus the previous
> and next sentences (infinite AJAX / JSON scrolling over the corpus)
> without validating / discarding the corpus.
>
> When the focus is on a sample. The previous and next samples should be
> displayed before and after with a lower opacity level in read-only
> mode so as to provide the user with contextual information to make the
> right decision on the active sample.
>
> At the end of the session, the user can export all the validated
> samples as a new corpus formatted using the OpenNLP format.
> Unprocessed or explicitly discarded samples are not part of this
> refined version of the annotated corpus.
>
> To implement this we plan to rewrite the server side part of Walter in
> two parts:
>
> 1- a set of JAX-RS resources to convert corpus items + their
> annotations JSON objects on the client to / from OpenNLP NameSamples
> on the server. The first embryon for this part is here:
>
>    https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web
>
> 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
> validation (with validation / discarding / update + previous and next
> navigation) and serialization of the validated samples to a new
> OpenNLP formatted file that can be fed to train a new generation of
> the model. The work on this part has started here:
>
>    https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner
>
Did you think of using GWT, it might be a very good fit for OpenNLP 
bacause all here
have a lot of experience with Java, but maybe not so much experience 
with JS?

> Have a look at the test folder to see what's currently implemented. I
> would like to keep this in a separate maven artifact to be able to
> build a simple alternative CLI variant of the refiner interface that
> does not require to start a jetty or tomcat instance  / browser.
>
> For the client side, Hannes started to check that jquery should make
> it easier to implement the ajax callbacks  based on mouse + keyboard
> interaction.
>
> As for the licensing, Hannes told me that his employer should be
> willing to license the relevant parts (non specific to Fraunhoffer)
> Walter under a liberal license (MIT, BSD or ASL) so that it should be
> possible to contribute it to the ASF in the long term.
>
> Another group tested DUALIST: the tool looks really nice for the text
> classification case, less so for the NE detection case (the sample
> view is not very well suited for structured output and it requires to
> build Hearst features by hand, dualist does not do it automatically
> apparently).
>
> It should be possible to turn the Walter refiner into a real active
> learning annotation for structured output (NE and relation extraction)
> if we use the confidence level of the SequentialPerceptron of OpenNLP
> and use the less confident predictions as priority samples for the
> ordering of the sample to processing using the refined after pressing
> "space" or "d". The server could incrementally used the refined sample
> to update it's model and adjust the priority of the next batch of
> samples to refine from time to time as the perceptron algorithm is
> online (supports partial update of the model without restarting from
> scratch).
>
> Another group worked on named entity disambiguation using Solr
> MoreLikeThisHandler and indexes of context occurrences of those
> entities occurring in Wikipedia article. This work will probably be
> integrated in Stanbol directly and should be less interesting for the
> OpenNLP project. Also another group worked on adapting pignlproc to
> their own tools and hadoop infrastructure.
Entity disambiguation would be very nice to have in OpenNLP and I also
need to work on that soon.
> Comments and pull-requests on the corpus-refiner prototype welcome. I
> plan to go on working on this project from time to time. AFAIK Hannes
> won't have time to work on the JS layer in the short term but it
> should be at least possible to have a first version of the command
> line based interface rather quickly.

Yes, it would be nice to have such a tool, but for OpenNLP Annotations it
must be more focused on crowd sourcing and to work well with a small / 
medium size group
of people.

And we of course need to extend it over time to support other kind of 
annotations tasks.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

Hi all,

FYI, I have implemented a first version of the corpus refiner CLI using jline:

  https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner/

It has bugs if you try to edit a the sample is longer that a terminal
line though. Hence I am thinking about writing an alternative swing
version.

Have a nice WE.

-- 
Olivier

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/22/11 9:43 AM, Jörn Kottmann wrote:
> There is no reason to restrict the access to this proposal. So I will
> try to grant edit permissions to everyone. 

I gave permission to edit/add pages to every confluence user.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/11/11 12:19 PM, Olivier Grisel wrote:
> 2011/6/11 Jason Baldridge<ja...@gmail.com>:
>>> As mentionned by Tommaso I think we should start to structure the wiki
>>> for this effort. Do you want me to create sub-pages of [1] for
>>> POS-tagging and NE detection? I could write the NE detection page
>>> with a description of the current effort on corpus-refiner / Walter
>>> and let you add pointers for the POS tags case.
>>>
>>> [1] https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>>>
>> Yep, that sounds great. I might not be able to get to it right away, but can
>> put it on my stack!
> Actually, my wiki account "ogrisel" does not have the permission to
> edit or create new pages.
>
There is no reason to restrict the access to this proposal. So I will
try to grant edit permissions to everyone.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/11 Jason Baldridge <ja...@gmail.com>:
>
>> As mentionned by Tommaso I think we should start to structure the wiki
>> for this effort. Do you want me to create sub-pages of [1] for
>> POS-tagging and NE detection? I could write the NE detection page
>> with a description of the current effort on corpus-refiner / Walter
>> and let you add pointers for the POS tags case.
>>
>> [1] https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>>
>
> Yep, that sounds great. I might not be able to get to it right away, but can
> put it on my stack!

Actually, my wiki account "ogrisel" does not have the permission to
edit or create new pages.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jason Baldridge <ja...@gmail.com>.

On Fri, Jun 10, 2011 at 10:29 AM, Olivier Grisel
<ol...@ensta.org>wrote:

>
> No idea. I think Jacob Perkins (and possibly others) who works with
> NLTK was also interested in such open copora. See for instance this
> thread on metaoptimize.com/qa:
>
>
> http://metaoptimize.com/qa/questions/4650/what-licenses-cover-a-nltk-tagger-trained-on-treebank
>
>
Great. I think a lot of people would benefit from a standard infrastructure
for annotation and training of models for different languages.


> > BTW, there is a lot that can be done to bootstrap POS-taggers from raw
> data
> > and the tags in Wiktionary, so if folks are interested in that I'm happy
> to
> > provide pointers.
>
> As mentionned by Tommaso I think we should start to structure the wiki
> for this effort. Do you want me to create sub-pages of [1] for
> POS-tagging and NE detection? I could write the NE detection page
> with a description of the current effort on corpus-refiner / Walter
> and let you add pointers for the POS tags case.
>
> [1] https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>
>
Yep, that sounds great. I might not be able to get to it right away, but can
put it on my stack!

Jason

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

2011/6/10 Jason Baldridge <ja...@gmail.com>:
> This looks great! I don't have time to look at this in great detail right
> now, but am happy to give feedback on particular issues and questions.
>
> Active learning would be nice to add eventually, but it has to be done with
> great care, e.g. using uncertainty alone doesn't really work that well and
> care needs to be taken with class imbalance etc. Random sampling is a good
> starting point, and can be used while ironing out the details.

Acknowledged. I wasn't planning to implement this part myself anyway.

> I can't remember if this has been discussed before, but does there need to
> be a non-OpenNLP group which has a primary purpose of creating open
> standardized datasets and annotation interfaces, etc?
>
> It seems also we might be able to get some corporate sponsorship for
> annotation, improvements to models, creation of resources for specific
> languages, etc.

No idea. I think Jacob Perkins (and possibly others) who works with
NLTK was also interested in such open copora. See for instance this
thread on metaoptimize.com/qa:

  http://metaoptimize.com/qa/questions/4650/what-licenses-cover-a-nltk-tagger-trained-on-treebank

> BTW, there is a lot that can be done to bootstrap POS-taggers from raw data
> and the tags in Wiktionary, so if folks are interested in that I'm happy to
> provide pointers.

As mentionned by Tommaso I think we should start to structure the wiki
for this effort. Do you want me to create sub-pages of [1] for
POS-tagging and NE detection? I could write the NE detection page
with a description of the current effort on corpus-refiner / Walter
and let you add pointers for the POS tags case.

[1] https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jason Baldridge <ja...@gmail.com>.

This looks great! I don't have time to look at this in great detail right
now, but am happy to give feedback on particular issues and questions.

Active learning would be nice to add eventually, but it has to be done with
great care, e.g. using uncertainty alone doesn't really work that well and
care needs to be taken with class imbalance etc. Random sampling is a good
starting point, and can be used while ironing out the details.

I can't remember if this has been discussed before, but does there need to
be a non-OpenNLP group which has a primary purpose of creating open
standardized datasets and annotation interfaces, etc?

It seems also we might be able to get some corporate sponsorship for
annotation, improvements to models, creation of resources for specific
languages, etc.

BTW, there is a lot that can be done to bootstrap POS-taggers from raw data
and the tags in Wiktionary, so if folks are interested in that I'm happy to
provide pointers.

-Jason


On Fri, Jun 10, 2011 at 9:12 AM, Olivier Grisel <ol...@ensta.org>wrote:

> Hi all,
>
> Here is a short report of the Berlin Buzzwords Semantic / NLP
> Hackathon that happened on Wednesday and yesterday at Neofonie and was
> related to this corpus annotation project.
>
> Basically we worked in small 2-3 people groups on various related topics.
>
> Hannes introduced a HTML / JS based tool named Walter to visualize and
> edit named entities and (optionally typed relations between those
> entities). Demo is here:
>
>  http://tmdemo.iais.fraunhofer.de/walter/
>
> Currently Walter walks with UIMA / XMI formatted files as input /
> output using a java servlet deployed on a tomcat server for instance.
> The plan is to adapt it to a corpus annotation validation / refinement
> pattern: feed it with a partially annotated corpus coming from the
> output of a OpenNLP pre-trained on the annotations extracted from
> Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
> multilingual models.
>
> We would like to make a fast binary interface with keyboard shortcuts
> to focus one sentence at a time. If the user think that all the
> entities in the sentence are correctly annotated by the model, he/she
> press "space" and the sentence is marked validated and the focus moves
> to the next sentence. If the sentence is complete gibberish he/she can
> discard the sample by pressing "d". The user can also fix individual
> annotations using the mouse interface before validating the corrected
> sample.
>
> Up arrow and down arrow allow the user to move to focus the previous
> and next sentences (infinite AJAX / JSON scrolling over the corpus)
> without validating / discarding the corpus.
>
> When the focus is on a sample. The previous and next samples should be
> displayed before and after with a lower opacity level in read-only
> mode so as to provide the user with contextual information to make the
> right decision on the active sample.
>
> At the end of the session, the user can export all the validated
> samples as a new corpus formatted using the OpenNLP format.
> Unprocessed or explicitly discarded samples are not part of this
> refined version of the annotated corpus.
>
> To implement this we plan to rewrite the server side part of Walter in
> two parts:
>
> 1- a set of JAX-RS resources to convert corpus items + their
> annotations JSON objects on the client to / from OpenNLP NameSamples
> on the server. The first embryon for this part is here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web
>
> 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
> validation (with validation / discarding / update + previous and next
> navigation) and serialization of the validated samples to a new
> OpenNLP formatted file that can be fed to train a new generation of
> the model. The work on this part has started here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner
>
> Have a look at the test folder to see what's currently implemented. I
> would like to keep this in a separate maven artifact to be able to
> build a simple alternative CLI variant of the refiner interface that
> does not require to start a jetty or tomcat instance  / browser.
>
> For the client side, Hannes started to check that jquery should make
> it easier to implement the ajax callbacks  based on mouse + keyboard
> interaction.
>
> As for the licensing, Hannes told me that his employer should be
> willing to license the relevant parts (non specific to Fraunhoffer)
> Walter under a liberal license (MIT, BSD or ASL) so that it should be
> possible to contribute it to the ASF in the long term.
>
> Another group tested DUALIST: the tool looks really nice for the text
> classification case, less so for the NE detection case (the sample
> view is not very well suited for structured output and it requires to
> build Hearst features by hand, dualist does not do it automatically
> apparently).
>
> It should be possible to turn the Walter refiner into a real active
> learning annotation for structured output (NE and relation extraction)
> if we use the confidence level of the SequentialPerceptron of OpenNLP
> and use the less confident predictions as priority samples for the
> ordering of the sample to processing using the refined after pressing
> "space" or "d". The server could incrementally used the refined sample
> to update it's model and adjust the priority of the next batch of
> samples to refine from time to time as the perceptron algorithm is
> online (supports partial update of the model without restarting from
> scratch).
>
> Another group worked on named entity disambiguation using Solr
> MoreLikeThisHandler and indexes of context occurrences of those
> entities occurring in Wikipedia article. This work will probably be
> integrated in Stanbol directly and should be less interesting for the
> OpenNLP project. Also another group worked on adapting pignlproc to
> their own tools and hadoop infrastructure.
>
> Comments and pull-requests on the corpus-refiner prototype welcome. I
> plan to go on working on this project from time to time. AFAIK Hannes
> won't have time to work on the JS layer in the short term but it
> should be at least possible to have a first version of the command
> line based interface rather quickly.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

Nice! It seems to me Walter/corpus-refiner could be useful with regards to
OpenNLP Annotations [1].
Thanks for the report Olivier :)
Tommaso

[1] : https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

2011/6/10 Olivier Grisel <ol...@ensta.org>

> Hi all,
>
> Here is a short report of the Berlin Buzzwords Semantic / NLP
> Hackathon that happened on Wednesday and yesterday at Neofonie and was
> related to this corpus annotation project.
>
> Basically we worked in small 2-3 people groups on various related topics.
>
> Hannes introduced a HTML / JS based tool named Walter to visualize and
> edit named entities and (optionally typed relations between those
> entities). Demo is here:
>
>  http://tmdemo.iais.fraunhofer.de/walter/
>
> Currently Walter walks with UIMA / XMI formatted files as input /
> output using a java servlet deployed on a tomcat server for instance.
> The plan is to adapt it to a corpus annotation validation / refinement
> pattern: feed it with a partially annotated corpus coming from the
> output of a OpenNLP pre-trained on the annotations extracted from
> Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
> multilingual models.


> We would like to make a fast binary interface with keyboard shortcuts
> to focus one sentence at a time. If the user think that all the
> entities in the sentence are correctly annotated by the model, he/she
> press "space" and the sentence is marked validated and the focus moves
> to the next sentence. If the sentence is complete gibberish he/she can
> discard the sample by pressing "d". The user can also fix individual
> annotations using the mouse interface before validating the corrected
> sample.
>
> Up arrow and down arrow allow the user to move to focus the previous
> and next sentences (infinite AJAX / JSON scrolling over the corpus)
> without validating / discarding the corpus.
>
> When the focus is on a sample. The previous and next samples should be
> displayed before and after with a lower opacity level in read-only
> mode so as to provide the user with contextual information to make the
> right decision on the active sample.
>
> At the end of the session, the user can export all the validated
> samples as a new corpus formatted using the OpenNLP format.
> Unprocessed or explicitly discarded samples are not part of this
> refined version of the annotated corpus.
>
> To implement this we plan to rewrite the server side part of Walter in
> two parts:
>
> 1- a set of JAX-RS resources to convert corpus items + their
> annotations JSON objects on the client to / from OpenNLP NameSamples
> on the server. The first embryon for this part is here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web
>
> 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
> validation (with validation / discarding / update + previous and next
> navigation) and serialization of the validated samples to a new
> OpenNLP formatted file that can be fed to train a new generation of
> the model. The work on this part has started here:
>
>
> https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner
>
> Have a look at the test folder to see what's currently implemented. I
> would like to keep this in a separate maven artifact to be able to
> build a simple alternative CLI variant of the refiner interface that
> does not require to start a jetty or tomcat instance  / browser.
>
> For the client side, Hannes started to check that jquery should make
> it easier to implement the ajax callbacks  based on mouse + keyboard
> interaction.
>
> As for the licensing, Hannes told me that his employer should be
> willing to license the relevant parts (non specific to Fraunhoffer)
> Walter under a liberal license (MIT, BSD or ASL) so that it should be
> possible to contribute it to the ASF in the long term.
>
> Another group tested DUALIST: the tool looks really nice for the text
> classification case, less so for the NE detection case (the sample
> view is not very well suited for structured output and it requires to
> build Hearst features by hand, dualist does not do it automatically
> apparently).
>
> It should be possible to turn the Walter refiner into a real active
> learning annotation for structured output (NE and relation extraction)
> if we use the confidence level of the SequentialPerceptron of OpenNLP
> and use the less confident predictions as priority samples for the
> ordering of the sample to processing using the refined after pressing
> "space" or "d". The server could incrementally used the refined sample
> to update it's model and adjust the priority of the next batch of
> samples to refine from time to time as the perceptron algorithm is
> online (supports partial update of the model without restarting from
> scratch).
>
> Another group worked on named entity disambiguation using Solr
> MoreLikeThisHandler and indexes of context occurrences of those
> entities occurring in Wikipedia article. This work will probably be
> integrated in Stanbol directly and should be less interesting for the
> OpenNLP project. Also another group worked on adapting pignlproc to
> their own tools and hadoop infrastructure.
>
> Comments and pull-requests on the corpus-refiner prototype welcome. I
> plan to go on working on this project from time to time. AFAIK Hannes
> won't have time to work on the JS layer in the short term but it
> should be at least possible to have a first version of the command
> line based interface rather quickly.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: OpenNLP Annotations Proposal

Posted by Olivier Grisel <ol...@ensta.org>.

Hi all,

Here is a short report of the Berlin Buzzwords Semantic / NLP
Hackathon that happened on Wednesday and yesterday at Neofonie and was
related to this corpus annotation project.

Basically we worked in small 2-3 people groups on various related topics.

Hannes introduced a HTML / JS based tool named Walter to visualize and
edit named entities and (optionally typed relations between those
entities). Demo is here:

  http://tmdemo.iais.fraunhofer.de/walter/

Currently Walter walks with UIMA / XMI formatted files as input /
output using a java servlet deployed on a tomcat server for instance.
The plan is to adapt it to a corpus annotation validation / refinement
pattern: feed it with a partially annotated corpus coming from the
output of a OpenNLP pre-trained on the annotations extracted from
Wikipedia using https://github.com/ogrisel/pignlproc to bootstrap
multilingual models.

We would like to make a fast binary interface with keyboard shortcuts
to focus one sentence at a time. If the user think that all the
entities in the sentence are correctly annotated by the model, he/she
press "space" and the sentence is marked validated and the focus moves
to the next sentence. If the sentence is complete gibberish he/she can
discard the sample by pressing "d". The user can also fix individual
annotations using the mouse interface before validating the corrected
sample.

Up arrow and down arrow allow the user to move to focus the previous
and next sentences (infinite AJAX / JSON scrolling over the corpus)
without validating / discarding the corpus.

When the focus is on a sample. The previous and next samples should be
displayed before and after with a lower opacity level in read-only
mode so as to provide the user with contextual information to make the
right decision on the active sample.

At the end of the session, the user can export all the validated
samples as a new corpus formatted using the OpenNLP format.
Unprocessed or explicitly discarded samples are not part of this
refined version of the annotated corpus.

To implement this we plan to rewrite the server side part of Walter in
two parts:

1- a set of JAX-RS resources to convert corpus items + their
annotations JSON objects on the client to / from OpenNLP NameSamples
on the server. The first embryon for this part is here:

  https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner-web

2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
validation (with validation / discarding / update + previous and next
navigation) and serialization of the validated samples to a new
OpenNLP formatted file that can be fed to train a new generation of
the model. The work on this part has started here:

  https://github.com/ogrisel/bbuzz-semantic-hackathon/tree/master/corpus-refiner

Have a look at the test folder to see what's currently implemented. I
would like to keep this in a separate maven artifact to be able to
build a simple alternative CLI variant of the refiner interface that
does not require to start a jetty or tomcat instance  / browser.

For the client side, Hannes started to check that jquery should make
it easier to implement the ajax callbacks  based on mouse + keyboard
interaction.

As for the licensing, Hannes told me that his employer should be
willing to license the relevant parts (non specific to Fraunhoffer)
Walter under a liberal license (MIT, BSD or ASL) so that it should be
possible to contribute it to the ASF in the long term.

Another group tested DUALIST: the tool looks really nice for the text
classification case, less so for the NE detection case (the sample
view is not very well suited for structured output and it requires to
build Hearst features by hand, dualist does not do it automatically
apparently).

It should be possible to turn the Walter refiner into a real active
learning annotation for structured output (NE and relation extraction)
if we use the confidence level of the SequentialPerceptron of OpenNLP
and use the less confident predictions as priority samples for the
ordering of the sample to processing using the refined after pressing
"space" or "d". The server could incrementally used the refined sample
to update it's model and adjust the priority of the next batch of
samples to refine from time to time as the perceptron algorithm is
online (supports partial update of the model without restarting from
scratch).

Another group worked on named entity disambiguation using Solr
MoreLikeThisHandler and indexes of context occurrences of those
entities occurring in Wikipedia article. This work will probably be
integrated in Stanbol directly and should be less interesting for the
OpenNLP project. Also another group worked on adapting pignlproc to
their own tools and hadoop infrastructure.

Comments and pull-requests on the corpus-refiner prototype welcome. I
plan to go on working on this project from time to time. AFAIK Hannes
won't have time to work on the JS layer in the short term but it
should be at least possible to have a first version of the command
line based interface rather quickly.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

I will definitely have a look at this tool, thanks for pointing
out.

For the labeling part I believe we should do both crowd sourcing
and with linguistic experts, even if we use the experts only to
label some test data which we need to measure how good the
crowd sourcing approach works.

Lets try to extend the proposal a little so we have some sort of a plan
which could get us started.

Jörn

On 6/8/11 6:36 PM, Jason Baldridge wrote:
> +1 This is awesome.
>
> Here is a tool that could be relevant in getting the ball rolling on some
> datasets:
>
> http://code.google.com/p/dualist/
>
> Jason
>
> On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins<ch...@yahoo.com>wrote:
>
>> Thanks Jörn I agree with your assessment.  This is exactly where I am at
>> the moment and I am sure many others.  You hit the nail on the head,
>> currently people have to start from scratch and thats daunting.  For the
>> phase when you start crowd sourcing I am wondering what this web based UI
>> would look like.  I am assuming that with some basic instructions things
>> like:
>>
>> - sentence boundary markup
>> - name identification (people, money, dates, locations, products)
>>
>> These are narrowly focused crowd source-able with somewhat trivial ui tasks
>> (For the following sentences highlight names of people, such as "steve
>> jobs", "prince william")
>>
>> When it comes to POS tagging (which is my current challenge) you can
>> approach it like the above ("For the following sentences select all the
>> nouns"). And re-assemble all the observations and perhaps use something like
>> triple judgements to look for disagreement, or you could have an editor that
>> lets a user markup the whole sentence (perhaps we fill in the parts we are
>> guessing already from a pre-learnt model).  Not sure if the triple judgement
>> is necessary, maybe sentences labeled with a collage of people would still
>> converge well in training.
>>
>> Both can be assisted by some prior trained model to help keep people awake
>> and on track :-}  I think you mentioned in a prior mail that you can even
>> use the models that were built with proprietary data to bootstrap the
>> assistance process.
>>
>> These are two ends of the spectrum, one assumes you are using people with
>> limited language skills and the other potentially much more competent.  One
>> you need to gather data probably from many more people, one much less.
>>   Personally I like the crowd sourced approach, but I wonder if OpenNLP could
>> find enough language "experts" per language that it makes better sense to
>> build a non web based app that perhaps is a little more expedient to
>> operate.
>>
>> For giggles assuming we needed to generate labels:
>> 60k lines of text
>> average word length == 11
>> number of judgments ==3
>>
>> We would be collecting almost 2M judgements from people that we would
>> reassemble into our training data after throwing out the bath water.
>>
>> Maybe with the competent language expert case we only get a sentence judged
>> once by a person.  There is perhaps no labeled sentence to be re-assembled,
>> but we may want to keep peoples judgements separate so we could validate
>> their work against others.
>>
>> The data processing pipeline looks somewhat different in each case. The
>> competent POS labeler case simplifies the process greatly for the pipeline.
>>
>> I would love to help in whatever way I can and can also find people to help
>> label data at my companies own expense to help accelerate this.
>>
>> Best
>>
>> C
>>
>> On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:
>>
>>> Hi all,
>>>
>>> based on some discussion we had in the past I put together
>>> a short proposal for a community based labeling project.
>>>
>>> Here is the link:
>>> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>>>
>>> Any comments and opinions are very welcome.
>>>
>>> Thanks,
>>> Jörn
>>
>

Re: OpenNLP Annotations Proposal

Posted by Jason Baldridge <ja...@gmail.com>.

+1 This is awesome.

Here is a tool that could be relevant in getting the ball rolling on some
datasets:

http://code.google.com/p/dualist/

Jason

On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins <ch...@yahoo.com>wrote:

> Thanks Jörn I agree with your assessment.  This is exactly where I am at
> the moment and I am sure many others.  You hit the nail on the head,
> currently people have to start from scratch and thats daunting.  For the
> phase when you start crowd sourcing I am wondering what this web based UI
> would look like.  I am assuming that with some basic instructions things
> like:
>
> - sentence boundary markup
> - name identification (people, money, dates, locations, products)
>
> These are narrowly focused crowd source-able with somewhat trivial ui tasks
> (For the following sentences highlight names of people, such as "steve
> jobs", "prince william")
>
> When it comes to POS tagging (which is my current challenge) you can
> approach it like the above ("For the following sentences select all the
> nouns"). And re-assemble all the observations and perhaps use something like
> triple judgements to look for disagreement, or you could have an editor that
> lets a user markup the whole sentence (perhaps we fill in the parts we are
> guessing already from a pre-learnt model).  Not sure if the triple judgement
> is necessary, maybe sentences labeled with a collage of people would still
> converge well in training.
>
> Both can be assisted by some prior trained model to help keep people awake
> and on track :-}  I think you mentioned in a prior mail that you can even
> use the models that were built with proprietary data to bootstrap the
> assistance process.
>
> These are two ends of the spectrum, one assumes you are using people with
> limited language skills and the other potentially much more competent.  One
> you need to gather data probably from many more people, one much less.
>  Personally I like the crowd sourced approach, but I wonder if OpenNLP could
> find enough language "experts" per language that it makes better sense to
> build a non web based app that perhaps is a little more expedient to
> operate.
>
> For giggles assuming we needed to generate labels:
> 60k lines of text
> average word length == 11
> number of judgments ==3
>
> We would be collecting almost 2M judgements from people that we would
> reassemble into our training data after throwing out the bath water.
>
> Maybe with the competent language expert case we only get a sentence judged
> once by a person.  There is perhaps no labeled sentence to be re-assembled,
> but we may want to keep peoples judgements separate so we could validate
> their work against others.
>
> The data processing pipeline looks somewhat different in each case. The
> competent POS labeler case simplifies the process greatly for the pipeline.
>
> I would love to help in whatever way I can and can also find people to help
> label data at my companies own expense to help accelerate this.
>
> Best
>
> C
>
> On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:
>
> > Hi all,
> >
> > based on some discussion we had in the past I put together
> > a short proposal for a community based labeling project.
> >
> > Here is the link:
> > https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
> >
> > Any comments and opinions are very welcome.
> >
> > Thanks,
> > Jörn
>
>


-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: OpenNLP Annotations Proposal

Posted by Chris Collins <ch...@yahoo.com>.

Thanks Jörn I agree with your assessment.  This is exactly where I am at the moment and I am sure many others.  You hit the nail on the head, currently people have to start from scratch and thats daunting.  For the phase when you start crowd sourcing I am wondering what this web based UI would look like.  I am assuming that with some basic instructions things like:

- sentence boundary markup
- name identification (people, money, dates, locations, products)

These are narrowly focused crowd source-able with somewhat trivial ui tasks (For the following sentences highlight names of people, such as "steve jobs", "prince william")

When it comes to POS tagging (which is my current challenge) you can approach it like the above ("For the following sentences select all the nouns"). And re-assemble all the observations and perhaps use something like triple judgements to look for disagreement, or you could have an editor that lets a user markup the whole sentence (perhaps we fill in the parts we are guessing already from a pre-learnt model).  Not sure if the triple judgement is necessary, maybe sentences labeled with a collage of people would still converge well in training.

Both can be assisted by some prior trained model to help keep people awake and on track :-}  I think you mentioned in a prior mail that you can even use the models that were built with proprietary data to bootstrap the assistance process.

These are two ends of the spectrum, one assumes you are using people with limited language skills and the other potentially much more competent.  One you need to gather data probably from many more people, one much less.  Personally I like the crowd sourced approach, but I wonder if OpenNLP could find enough language "experts" per language that it makes better sense to build a non web based app that perhaps is a little more expedient to operate.

For giggles assuming we needed to generate labels:
60k lines of text
average word length == 11
number of judgments ==3

We would be collecting almost 2M judgements from people that we would reassemble into our training data after throwing out the bath water.

Maybe with the competent language expert case we only get a sentence judged once by a person.  There is perhaps no labeled sentence to be re-assembled, but we may want to keep peoples judgements separate so we could validate their work against others.

The data processing pipeline looks somewhat different in each case. The competent POS labeler case simplifies the process greatly for the pipeline.

I would love to help in whatever way I can and can also find people to help label data at my companies own expense to help accelerate this.

Best

C

On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:

> Hi all,
> 
> based on some discussion we had in the past I put together
> a short proposal for a community based labeling project.
> 
> Here is the link:
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
> 
> Any comments and opinions are very welcome.
> 
> Thanks,
> Jörn

Re: OpenNLP Annotations Proposal

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/7/11 4:36 PM, Tommaso Teofili wrote:
> Hello Jörn,
> did you know about the Open Relevance project [1], maybe we can ask them to
> join forces (even if I don't see so much activity on their MLs).
> However I'd be happy to contribute in such an effort.

As I understand it, they are collecting massive amounts of free text,
where our focus is to manually annotate enough text to train our components.

Maybe it is a good source for new text when we are done with wikinews or 
want
to annotate text of different nature, e.g. blogs, emails, etc.

Jörn

Re: OpenNLP Annotations Proposal

Posted by Tommaso Teofili <to...@gmail.com>.

Hello Jörn,
did you know about the Open Relevance project [1], maybe we can ask them to
join forces (even if I don't see so much activity on their MLs).
However I'd be happy to contribute in such an effort.
Tommaso

[1] : http://lucene.apache.org/openrelevance/

2011/6/7 Jörn Kottmann <ko...@gmail.com>

> Hi all,
>
> based on some discussion we had in the past I put together
> a short proposal for a community based labeling project.
>
> Here is the link:
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>
> Any comments and opinions are very welcome.
>
> Thanks,
> Jörn
>