You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Johann Petrak <jo...@gmail.com> on 2021/02/18 14:21:22 UTC

UIMA Conventions for certain NLP constructs?

Hi all,

I hope this is the right forum to ask these questions, if not, please
excuse my mistake
and point me to the right forum or source.

I am currently looking into common conventions for how NLP tools represent
certain
common concepts in NLP and I would like to learn if there are standards,
definitions
or conventions for how this is done by UIMA annotators.
I have to admit that I have never really worked with UIMA and that my
knowledge
for how things work with UIMA is limited. Please excuse if I am not using
the right
terminology.

What I am interested in most are the following aspects of representing
NLP-related
concepts with stand-off annotations: I would be extremely glad if somebody
could
give me a rough explanation for how UIMA does this or where in the
documentation
it would be best to look to figure out these things:

* multi-word tokens and their features: I guess that most UIMA processing
pipelines
  will start off with some kind of tokenization where token or word
annotations
  (and their offset ranges) are created. But how are multi-word tokens,
e.g.
  Spanish "vámonos" = "vamos", "nos" and subsequently properties of the
words
  e.g. POS, lemma ("ir", "nosotros") handled? While the multiword token
itself
  obviously can be associated with an offset range, the words for that
token cannot,
  so how are they annotated?
* how are dependency trees or constituency parses represented? Is there a
specific
  data structure just for each of those or for trees or graphs with
annotations as leaves
  in general?
  Similarly, is there a convention for how to represent coreference chains?
* Is there a convention for how to represent cross-document coreferences?
* Is there a convention for how to represent parallel documents and map
between annotations in
  parallel texts or represent word alignments?
* How are multilingual documents handled, where different parts of the
document, maybe
  even just parts of a sentence switch language and thus may need to get
processed differently?
  Is there a convention for representing such switches in language  and for
how to deal with this?
* How does UIMA handle documents from corpora that only contain tokens
sequences but not any
  whitespace (e.g. original Conll corpora)?

Any information about this or about how to find out about these things in
the documentation
would be extremely welcome.

Many thanks and all the best,
  Johann Petrak

---
http://johann-petrak.github.io/

Re: UIMA Conventions for certain NLP constructs?

Posted by Johann Petrak <jo...@gmail.com>.

Thanks a lot for your detailed reply, again extremely useful and
interesting!
(I have put some responses inline)

On Thu, 18 Feb 2021 at 22:14, Richard Eckart de Castilho <re...@apache.org>
wrote:

>
> There is GATE of course ;) Although as far as I understood, GATE like UIMA
> intentionally does not prescribe a particular annotation schema / encoding
> so that it remains as flexible as possible to its users.
>

Yes - I have been both a user and developer of GATE and from my own
personal POV,
this flexibility is both a blessing and a curse. In addition, GATE does not
support
order of annotations over the same span or of zero-length annotations so
one would
have to deal with this too.
This is why in the new Python GateNLP package, there is support for
everything
necessary to do this (ordered annotations for the same/zero-length span and
proper support of zero-length annotations) and a pre-implemented convention
for how to represent MWTs: if there are more words than the token has
characters,
the words are represented as annotations over evenly divided character
ranges,
otherwise the last words which do not fit are ordered zero-length
annotations.

This is not enforced but treated as a useful convention -- the more tools
follow
the convention, the easier they will be able to interact.

> Nancy Ide [1] has done a lot of work on interoperability in the NLP
> space. One of the recent projects she is involved in is the LAPPS Grid [2]
> which includes a JSON-based data format, a schema, and an whole processing
> platform including components. The LAPPS Grid also integrates third-party
> components such as GATE or DKPro Core.
>

This is a very interesting pointer, thank you!

>
> In Germany, there is the Weblicht [3] platform of CLARIN-D. They have the
> XML-based TCF format for representing their stuff.
>
> In the Netherlands, there is CLARIAH [4]. They have the XML-based FoLiA
> and a lot of stuff building on that, e.g. CLAM [6].
>
> From the semantic web space, there is the RDF-based NIF [7].
>
> ... and these are just the ones I remember off the top of my head.
>
> If you follow these references and do a bit of digging, you probably find
> much more.
>
> However, doing a fine-grained comparison between all of these do distill
> commonalities
> and differences is quite a daunting task. Been there, done that - as you
> say - that is
> a place few people dare to venture.
>
>
Thanks for all the pointers - you are right, it is quite daunting!

All the best,
  Johann

Cheers,
>
> -- Richard
>
> [1] https://scholar.google.de/citations?hl=de&user=WkfhlGkAAAAJ
> [2] https://www.lappsgrid.org
> [3] https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page
> [4] https://www.clariah.nl
> [5] https://pypi.org/project/FoLiA/
> [6] https://clam.readthedocs.io/en/latest/installation.html
> [7]
> https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html

Re: UIMA Conventions for certain NLP constructs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Johann,

I think there is quite a bit of work that you can look at.

Besides all the third party UIMA libraries (DKPro Core, ClearTK, cTAKES, JCoRe, etc.), 
there is a lot more that one can look at.

There is GATE of course ;) Although as far as I understood, GATE like UIMA
intentionally does not prescribe a particular annotation schema / encoding
so that it remains as flexible as possible to its users.

Nancy Ide [1] has done a lot of work on interoperability in the NLP
space. One of the recent projects she is involved in is the LAPPS Grid [2]
which includes a JSON-based data format, a schema, and an whole processing
platform including components. The LAPPS Grid also integrates third-party
components such as GATE or DKPro Core.

In Germany, there is the Weblicht [3] platform of CLARIN-D. They have the
XML-based TCF format for representing their stuff.

In the Netherlands, there is CLARIAH [4]. They have the XML-based FoLiA
and a lot of stuff building on that, e.g. CLAM [6].

From the semantic web space, there is the RDF-based NIF [7].

... and these are just the ones I remember off the top of my head.

If you follow these references and do a bit of digging, you probably find much more.

However, doing a fine-grained comparison between all of these do distill commonalities
and differences is quite a daunting task. Been there, done that - as you say - that is
a place few people dare to venture.

Cheers,

-- Richard

[1] https://scholar.google.de/citations?hl=de&user=WkfhlGkAAAAJ
[2] https://www.lappsgrid.org
[3] https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page
[4] https://www.clariah.nl
[5] https://pypi.org/project/FoLiA/
[6] https://clam.readthedocs.io/en/latest/installation.html
[7] https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html

Re: UIMA Conventions for certain NLP constructs?

Posted by Johann Petrak <jo...@gmail.com>.

Hi Richard,

thanks a lot for that detailed reply, this is very helpful!

TBH I was not really aware of those third-party tools, thanks for
mentioning them!

Your detailed responses to all the questions are very helpful and
interesting, and to some part
match the approaches I have considered / thought of.
The reason why I am interested is that the number of tools, services and
also resources for
NLP is growing immensely but there is rarely an attempt for how to make
them work among each
other and often it is not that easy to even find out how a specific tool is
doing this.
So I am trying to find out if for some of these problems there has maybe
evolved a common
approach or something that could be used as the "greatest common divisor".

Thanks again and all the best,

  Johann

---
http://johann-petrak.github.io/

On Thu, 18 Feb 2021 at 15:43, Richard Eckart de Castilho <re...@apache.org>
wrote:

> Hi Johann,
>
> the UIMA framework itself does not define how such linguistic concepts
> are modelled. What it does is offering a framework within which the
> concepts
> can be modeled without prescribing a particular way.
>
> There are various third parties that provide so-called "type systems".
> These type systems then specify how certain phenomena are represented.
> Usually, these type systems are part of a library of UIMA components
> coming from the same third party.
>
> A non-exhaustive list of such third parties is:
>
> - ClearTK
> - JCoRe
> - cTAKES
> - ...
> - and DKPro Core (which btw. I am maintaining - please excuse if I limit
> examples below to DKPro Core - but going into all type systems would be a
> thesis. If you really want, I can point you to mine which has a chapter on
> type system design...)
>
> > * multi-word tokens and their features: I guess that most UIMA processing
> > pipelines will start off with some kind of tokenization where token or
> word
> > annotations (and their offset ranges) are created. But how are
> multi-word tokens,
> > e.g. Spanish "vámonos" = "vamos", "nos" and subsequently properties of
> the
> > words e.g. POS, lemma ("ir", "nosotros") handled? While the multiword
> token
> > itself obviously can be associated with an offset range, the words for
> that
> > token cannot, so how are they annotated?
>
> Difficult one. DKPro Core offers different ways this could be modeled.
> For example, we introduced an "order" feature on the token that allows
> multiple tokens to share the same position but defines an order in which
> the
> tokens should be processed:
>
> - https://github.com/dkpro/dkpro-core/issues/1152
>
> Related to that is also the "form" feature because instead of the actual
> text,
> processing should maybe happen on a normalized form of the token:
>
> - https://github.com/dkpro/dkpro-core/issues/953
>
> > * how are dependency trees or constituency parses represented? Is there a
> > specific data structure just for each of those or for trees or graphs
> with
> > annotations as leaves in general?
>
> The UIMA CAS is essentially an object graph - a tree can be easily
> modelled.
> However, there is no built-in "tree" type. Here is how DKPro Core does it:
>
>
> https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_syntax
>
> >  Similarly, is there a convention for how to represent coreference
> chains?
>
> Again, here is how DKPro Core does it. The approach is also used by the
> annotation tools INCEpTION and WebAnno which I work on.
>
>
> https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_coreference
>
> > * Is there a convention for how to represent cross-document coreferences?
>
> One way of doing it is via a shared identifier - e.g. an identifer from a
> knowledge resource such as Wikidata (e.g. INCEpTION does that).
>
> Otherwise, you could come up with your own convention such as combining a
> URL with some offset information. The W3C Web Annotation standard has a
> nice
> overview of different way of modelling reference targets.
>
> > * Is there a convention for how to represent parallel documents and map
> > between annotations in parallel texts or represent word alignments?
>
> You can use cross-document links.
>
> You can also use the concept of a "view" in UIMA to pair your documents up.
> Then you can define a custom feature structure which has pointers to both
> views (i.e. both text versions).
>
> > * How are multilingual documents handled, where different parts of the
> > document, maybe even just parts of a sentence switch language and thus
> may need to get
> > processed differently?
> > Is there a convention for representing such switches in language  and for
> > how to deal with this?
>
> That is pretty specific to how a particular library of UIMA processing
> components
> is implemented. Some may have a way of specifying a language for portions
> of a
> text. Others might expect the user to break the text up into segments
> containing
> only one language and to process them separately.
>
> > * How does UIMA handle documents from corpora that only contain tokens
> > sequences but not any whitespace (e.g. original Conll corpora)?
>
> UIMA itself does not worry about that. The DKPro Core library includes
> readers
> for different kinds of formats including CoNLL-U. The readers try to make a
> reasonable choice for whitespace handling depending on the format. E.g. for
> most CoNLL formats, we would introduce a space between tokens and a line
> break
> between sentences.
>
> The CoNLL-U format includes metadata as to where spaces should be added
> and the
> DKPro Core ConnluReader tries to honor this information.
>
> > Any information about this or about how to find out about these things in
> > the documentation would be extremely welcome.
>
> I'm afraid, you won't find such information on the UIMA website. You'd
> need to
> turn to the websites of the different third-party libraries and to papers
> the
> authors of these libraries may have written.
>
> Cheers,
>
> -- Richard
>
>
>

Re: UIMA Conventions for certain NLP constructs?

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi Johann,

the UIMA framework itself does not define how such linguistic concepts
are modelled. What it does is offering a framework within which the concepts
can be modeled without prescribing a particular way.

There are various third parties that provide so-called "type systems".
These type systems then specify how certain phenomena are represented.
Usually, these type systems are part of a library of UIMA components
coming from the same third party.

A non-exhaustive list of such third parties is:

- ClearTK
- JCoRe
- cTAKES
- ...
- and DKPro Core (which btw. I am maintaining - please excuse if I limit
examples below to DKPro Core - but going into all type systems would be a
thesis. If you really want, I can point you to mine which has a chapter on
type system design...)

> * multi-word tokens and their features: I guess that most UIMA processing
> pipelines will start off with some kind of tokenization where token or word
> annotations (and their offset ranges) are created. But how are multi-word tokens,
> e.g. Spanish "vámonos" = "vamos", "nos" and subsequently properties of the
> words e.g. POS, lemma ("ir", "nosotros") handled? While the multiword token
> itself obviously can be associated with an offset range, the words for that
> token cannot, so how are they annotated?

Difficult one. DKPro Core offers different ways this could be modeled.
For example, we introduced an "order" feature on the token that allows
multiple tokens to share the same position but defines an order in which the
tokens should be processed:

- https://github.com/dkpro/dkpro-core/issues/1152

Related to that is also the "form" feature because instead of the actual text,
processing should maybe happen on a normalized form of the token:

- https://github.com/dkpro/dkpro-core/issues/953

> * how are dependency trees or constituency parses represented? Is there a
> specific data structure just for each of those or for trees or graphs with
> annotations as leaves in general?

The UIMA CAS is essentially an object graph - a tree can be easily modelled.
However, there is no built-in "tree" type. Here is how DKPro Core does it:

https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_syntax

>  Similarly, is there a convention for how to represent coreference chains?

Again, here is how DKPro Core does it. The approach is also used by the
annotation tools INCEpTION and WebAnno which I work on.

https://dkpro.github.io/dkpro-core/releases/2.1.0/docs/typesystem-reference.html#_coreference

> * Is there a convention for how to represent cross-document coreferences?

One way of doing it is via a shared identifier - e.g. an identifer from a 
knowledge resource such as Wikidata (e.g. INCEpTION does that). 

Otherwise, you could come up with your own convention such as combining a
URL with some offset information. The W3C Web Annotation standard has a nice
overview of different way of modelling reference targets.

> * Is there a convention for how to represent parallel documents and map
> between annotations in parallel texts or represent word alignments?

You can use cross-document links.

You can also use the concept of a "view" in UIMA to pair your documents up.
Then you can define a custom feature structure which has pointers to both
views (i.e. both text versions).

> * How are multilingual documents handled, where different parts of the
> document, maybe even just parts of a sentence switch language and thus may need to get
> processed differently?
> Is there a convention for representing such switches in language  and for
> how to deal with this?

That is pretty specific to how a particular library of UIMA processing components
is implemented. Some may have a way of specifying a language for portions of a
text. Others might expect the user to break the text up into segments containing
only one language and to process them separately.

> * How does UIMA handle documents from corpora that only contain tokens
> sequences but not any whitespace (e.g. original Conll corpora)?

UIMA itself does not worry about that. The DKPro Core library includes readers
for different kinds of formats including CoNLL-U. The readers try to make a
reasonable choice for whitespace handling depending on the format. E.g. for
most CoNLL formats, we would introduce a space between tokens and a line break
between sentences.

The CoNLL-U format includes metadata as to where spaces should be added and the
DKPro Core ConnluReader tries to honor this information.

> Any information about this or about how to find out about these things in
> the documentation would be extremely welcome.

I'm afraid, you won't find such information on the UIMA website. You'd need to
turn to the websites of the different third-party libraries and to papers the 
authors of these libraries may have written.

Cheers,

-- Richard