You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Phillip Rhodes <mo...@gmail.com> on 2021/11/24 19:59:28 UTC

Question about format / annotations in parser training data?

OpenNLP team:

I'm trying to work out how to train a Parser model with OpenNLP. I see that
I need to acquire a body of training data in OpenNLP format, which the docs
suggest is basically Penn Treebank format, with one sentence per line. OK,
this part is fine. The rub is, the "real" PTB data is hidden away by the
gatekeeping / rent-seeking Linguistic Data Consortium and for my purposes
is effectively unavailable.

"Fine", I find myself thinking, I can get some data elsewhere, even if it
means annotation my own from raw text. Or maybe I can borrow bits from
something like the The Treebank Semantics Parsed Corpus.  But here's where
my question comes in:

In the couple of examples of the training data format in the OpenNLP docs,
we see stuff like this:

(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))

In the manual, we are referred to <
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html>
for info on the annotations, but by and l large what is shown on this page
does not match the supplied example. For example, there is no mention of
annotations "TOP", "S", "NP-SBJ", and so on. This leads me to question if I
can locate useful pre-existing data, or transform data programmatically, or
whatever, which I can then use for OpenNLP.

Is there anything I can look at (besides digging in to the source for the
Parser, which I will resort to if it comes to that) to help me understand
exactly what the training data needs to look like? Maybe a slightly larger
sample of a known good training data file?

The thing is, I don't need much data, because my real goal is not a
complete parser for generic English. I want something about half a step
above "toy model" just so I can do experiments with the mapping from the
syntactically parsed test, to my notions of a corresponding semantic model.


Thanks for any and all help!


Phil
~~~
This message optimized for indexing by NSA PRISM

Re: Question about format / annotations in parser training data?

Posted by Phillip Rhodes <mo...@gmail.com>.

On Fri, Nov 26, 2021 at 8:08 AM Rodrigo Agerri <ro...@ehu.eus>
wrote:

> There might be an open version of the penn treebank somewhere. Still, you
> can use the GUM corpus, which I think is freely available:
>
> https://corpling.uis.georgetown.edu/gum/annotations.html#const
>

Awesome, thanks. I wasn't aware of the GUM corpus before now.




> Regarding the other question, what you need is the description of the
> constituents and function tags:
>
>
> http://surdeanu.cs.arizona.edu//mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
>
>
Perfect. Thanks!


Phil

Re: Question about format / annotations in parser training data?

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

There might be an open version of the penn treebank somewhere. Still, you
can use the GUM corpus, which I think is freely available:

https://corpling.uis.georgetown.edu/gum/annotations.html#const

For other languages, you could use for example the Evalita 2011 corpus
(Italian), or the Alpino corpus (for Dutch). Also the Ancora (Spanish,
Catalan). They are publicly available. Also Negra for German, and there
might be others I cannot think of right now.

Regarding the other question, what you need is the description of the
constituents and function tags:

http://surdeanu.cs.arizona.edu//mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

Best,

R

On Wed, 24 Nov 2021 at 20:59, Phillip Rhodes <mo...@gmail.com>
wrote:

> OpenNLP team:
>
> I'm trying to work out how to train a Parser model with OpenNLP. I see that
> I need to acquire a body of training data in OpenNLP format, which the docs
> suggest is basically Penn Treebank format, with one sentence per line. OK,
> this part is fine. The rub is, the "real" PTB data is hidden away by the
> gatekeeping / rent-seeking Linguistic Data Consortium and for my purposes
> is effectively unavailable.
>
> "Fine", I find myself thinking, I can get some data elsewhere, even if it
> means annotation my own from raw text. Or maybe I can borrow bits from
> something like the The Treebank Semantics Parsed Corpus.  But here's where
> my question comes in:
>
> In the couple of examples of the training data format in the OpenNLP docs,
> we see stuff like this:
>
> (TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
> (TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))
>
> In the manual, we are referred to <
> https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
> >
> for info on the annotations, but by and l large what is shown on this page
> does not match the supplied example. For example, there is no mention of
> annotations "TOP", "S", "NP-SBJ", and so on. This leads me to question if I
> can locate useful pre-existing data, or transform data programmatically, or
> whatever, which I can then use for OpenNLP.
>
> Is there anything I can look at (besides digging in to the source for the
> Parser, which I will resort to if it comes to that) to help me understand
> exactly what the training data needs to look like? Maybe a slightly larger
> sample of a known good training data file?
>
> The thing is, I don't need much data, because my real goal is not a
> complete parser for generic English. I want something about half a step
> above "toy model" just so I can do experiments with the mapping from the
> syntactically parsed test, to my notions of a corresponding semantic model.
>
>
> Thanks for any and all help!
>
>
> Phil
> ~~~
> This message optimized for indexing by NSA PRISM
>