You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Joe Corneli <ho...@gmail.com> on 2016/01/14 16:39:52 UTC

multiple pos and alternate parses (clojure interface)

I'm using the OpenNLP Clojure interface,
https://github.com/dakrone/clojure-opennlp

In my first attempt at parsing a sentence with the Treebank model, I
tried the following:

(treebank-parser ["What can happen in a second ."])

I got the following answer:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

For the most part that seems OK, except that "second" is tagged as an
adjective (JJ) rather than as a noun (NN).

[I'm certainly no linguist, but is it even meaningful to talk about a NP
without a noun in it?]

Anyway, at a technical level, I wonder how I can get the parser (or
tagger) to notice and show me the alternative possibilities (i.e. where
"second" is understood as a noun)?

>From looking around online, I'm pretty sure this is possible, though I
don't know if it's directly supported by the Clojure interface!  I'd
also appreciate any pointers to how to do it directly in Java, so I know
what sorts of questions to ask next.

Many thanks,

Joe

PS. The issue of indeterminacy is described in "Building a large
annotated corpus of English: the Penn Treebank" as follows:

 «Since a major concern of the Treebank is avoid requiring annotators to
 make arbitrary decisions, we allow words to be associated with more than
 one POS tag. Such multiple tagging indicates either that the word's part
 of speech simply cannot be decided or that the annotator is unsure which
 of the alternative tags is the correct one. In principle, annotators can
 tag a word with any number of tags, but in practice, multiple tags are
 restricted to a small number of recurring two-tag combinations: JJNN
 (adjective or noun as prenominal modifier), JJVBG (adjective or
 gerund/present participle), JJVBN (adjective or past participle), NNVBG
 (noun or gerund), and RBRP (adverb or particle).»

  - https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html