You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/12/05 10:19:01 UTC

Lemmatizer BUG

Hello,
I am doing some tests with the lemmatizerME.
It is returning a wrong word, a word that never occurs in the training
data. Basically it is NOT an italian word :)

The output is:

[O, O, O, O, *R1trR0ae*]

The code:

        try (InputStream in = new
FileInputStream("/home/damiano/lemmas.bin")) {
            LemmatizerModel lemmatizerModel = new LemmatizerModel(in);

            LemmatizerME lem = new LemmatizerME(lemmatizerModel);

            String[] tokens = new String[] {
                "ultimo", "capitolo", "della", "saga", "iniziata"
            };

            String[] pos = new String[] {
                "As", "Ss", "EA", "Ss", "Vp"
            };

            System.out.println(Arrays.toString(lem.lemmatize(tokens, pos)));
        }

How can i analyze what happened?

Thanks
Damiano

Re: Lemmatizer BUG

Posted by Damiano Porta <da...@gmail.com>.

Perfect! Thank you!


2016-12-05 15:46 GMT+01:00 Rodrigo Agerri <ro...@ehu.eus>:

> Hello,
>
> The javadoc says that the implementation of the statistical lemmatizer is
> based on:
>
> http://grzegorz.chrupala.me/papers/phd-single.pdf
>
> Check Chapter 6.
>
> This paper summarizes greatly that chapter
>
> http://grzegorz.chrupala.me/papers/chrupala-etal-2008a/paper.pdf
>
> To cut a long story short, the statistical lemmatizer does not learn the
> lemmas themselves, but the automatically induced classes obtained from
> calculating how many permutations are required to go from the word form to
> the lemma. This is because it is much easier to generalize (e.g., many
> word-lemma pairs are captured by the same permutation class) to learn over
> those permutation classes than on the lemmas themselves.
>
> HTH,
>
> Rodrigo
>
>
> On Mon, Dec 5, 2016 at 3:40 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Hello Rodrigo!
> > Thank you so much! It works perfectly... but, what is the reason behind
> the
> > use of the permuations? Why can we not have the lemma directly?
> >
> > Thanks for the clarification
> > Damiano
> >
> >
> > 2016-12-05 12:12 GMT+01:00 Rodrigo Agerri <ra...@apache.org>:
> >
> > > Hello,
> > >
> > > The String[] lemmatize(String[] toks, String[] tags) method will give
> you
> > > predicted "lemma class" which consists of the number of permutations
> > > required to go from the word form to the lemma.
> > >
> > > If the output is O that means that no permutation is required, namely,
> > the
> > > lemma and the word form are considered to be the same string. The last
> > item
> > > in the array is for iniziata, and the class means "replace the letter t
> > in
> > > position 1 with r; replace letter a with letter e in position 0",
> > resulting
> > > in "iniziare". The word form and lemma strings are reversed for
> > comparison.
> > > I am assuming that you added the asterisks...
> > >
> > > Once you have that lemma class prediction array, you need to apply the
> > > String[] decodeLemmas(String[] toks, String[] preds) in the same
> > > LemmatizerME class, which as the javacode states, it requires the
> arrays
> > of
> > > tokens and predicted lemma classes, to perform the decoding (apply the
> > > permutations) and output the actual lemma (iniziare in your example).
> > >
> > > Cheers,
> > >
> > > Rodrigo
> > >
> > > On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <damianoporta@gmail.com
> >
> > > wrote:
> > >
> > > > Hello,
> > > > I am doing some tests with the lemmatizerME.
> > > > It is returning a wrong word, a word that never occurs in the
> training
> > > > data. Basically it is NOT an italian word :)
> > > >
> > > > The output is:
> > > >
> > > > [O, O, O, O, *R1trR0ae*]
> > > >
> > > > The code:
> > > >
> > > >         try (InputStream in = new
> > > > FileInputStream("/home/damiano/lemmas.bin")) {
> > > >             LemmatizerModel lemmatizerModel = new
> LemmatizerModel(in);
> > > >
> > > >             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
> > > >
> > > >             String[] tokens = new String[] {
> > > >                 "ultimo", "capitolo", "della", "saga", "iniziata"
> > > >             };
> > > >
> > > >             String[] pos = new String[] {
> > > >                 "As", "Ss", "EA", "Ss", "Vp"
> > > >             };
> > > >
> > > >             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> > > > pos)));
> > > >         }
> > > >
> > > > How can i analyze what happened?
> > > >
> > > > Thanks
> > > > Damiano
> > > >
> > >
> >
>

Re: Lemmatizer BUG

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

The javadoc says that the implementation of the statistical lemmatizer is
based on:

http://grzegorz.chrupala.me/papers/phd-single.pdf

Check Chapter 6.

This paper summarizes greatly that chapter

http://grzegorz.chrupala.me/papers/chrupala-etal-2008a/paper.pdf

To cut a long story short, the statistical lemmatizer does not learn the
lemmas themselves, but the automatically induced classes obtained from
calculating how many permutations are required to go from the word form to
the lemma. This is because it is much easier to generalize (e.g., many
word-lemma pairs are captured by the same permutation class) to learn over
those permutation classes than on the lemmas themselves.

HTH,

Rodrigo


On Mon, Dec 5, 2016 at 3:40 PM, Damiano Porta <da...@gmail.com>
wrote:

> Hello Rodrigo!
> Thank you so much! It works perfectly... but, what is the reason behind the
> use of the permuations? Why can we not have the lemma directly?
>
> Thanks for the clarification
> Damiano
>
>
> 2016-12-05 12:12 GMT+01:00 Rodrigo Agerri <ra...@apache.org>:
>
> > Hello,
> >
> > The String[] lemmatize(String[] toks, String[] tags) method will give you
> > predicted "lemma class" which consists of the number of permutations
> > required to go from the word form to the lemma.
> >
> > If the output is O that means that no permutation is required, namely,
> the
> > lemma and the word form are considered to be the same string. The last
> item
> > in the array is for iniziata, and the class means "replace the letter t
> in
> > position 1 with r; replace letter a with letter e in position 0",
> resulting
> > in "iniziare". The word form and lemma strings are reversed for
> comparison.
> > I am assuming that you added the asterisks...
> >
> > Once you have that lemma class prediction array, you need to apply the
> > String[] decodeLemmas(String[] toks, String[] preds) in the same
> > LemmatizerME class, which as the javacode states, it requires the arrays
> of
> > tokens and predicted lemma classes, to perform the decoding (apply the
> > permutations) and output the actual lemma (iniziare in your example).
> >
> > Cheers,
> >
> > Rodrigo
> >
> > On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <da...@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I am doing some tests with the lemmatizerME.
> > > It is returning a wrong word, a word that never occurs in the training
> > > data. Basically it is NOT an italian word :)
> > >
> > > The output is:
> > >
> > > [O, O, O, O, *R1trR0ae*]
> > >
> > > The code:
> > >
> > >         try (InputStream in = new
> > > FileInputStream("/home/damiano/lemmas.bin")) {
> > >             LemmatizerModel lemmatizerModel = new LemmatizerModel(in);
> > >
> > >             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
> > >
> > >             String[] tokens = new String[] {
> > >                 "ultimo", "capitolo", "della", "saga", "iniziata"
> > >             };
> > >
> > >             String[] pos = new String[] {
> > >                 "As", "Ss", "EA", "Ss", "Vp"
> > >             };
> > >
> > >             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> > > pos)));
> > >         }
> > >
> > > How can i analyze what happened?
> > >
> > > Thanks
> > > Damiano
> > >
> >
>

Re: Lemmatizer BUG

Posted by Damiano Porta <da...@gmail.com>.

Hello Rodrigo!
Thank you so much! It works perfectly... but, what is the reason behind the
use of the permuations? Why can we not have the lemma directly?

Thanks for the clarification
Damiano


2016-12-05 12:12 GMT+01:00 Rodrigo Agerri <ra...@apache.org>:

> Hello,
>
> The String[] lemmatize(String[] toks, String[] tags) method will give you
> predicted "lemma class" which consists of the number of permutations
> required to go from the word form to the lemma.
>
> If the output is O that means that no permutation is required, namely, the
> lemma and the word form are considered to be the same string. The last item
> in the array is for iniziata, and the class means "replace the letter t in
> position 1 with r; replace letter a with letter e in position 0", resulting
> in "iniziare". The word form and lemma strings are reversed for comparison.
> I am assuming that you added the asterisks...
>
> Once you have that lemma class prediction array, you need to apply the
> String[] decodeLemmas(String[] toks, String[] preds) in the same
> LemmatizerME class, which as the javacode states, it requires the arrays of
> tokens and predicted lemma classes, to perform the decoding (apply the
> permutations) and output the actual lemma (iniziare in your example).
>
> Cheers,
>
> Rodrigo
>
> On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <da...@gmail.com>
> wrote:
>
> > Hello,
> > I am doing some tests with the lemmatizerME.
> > It is returning a wrong word, a word that never occurs in the training
> > data. Basically it is NOT an italian word :)
> >
> > The output is:
> >
> > [O, O, O, O, *R1trR0ae*]
> >
> > The code:
> >
> >         try (InputStream in = new
> > FileInputStream("/home/damiano/lemmas.bin")) {
> >             LemmatizerModel lemmatizerModel = new LemmatizerModel(in);
> >
> >             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
> >
> >             String[] tokens = new String[] {
> >                 "ultimo", "capitolo", "della", "saga", "iniziata"
> >             };
> >
> >             String[] pos = new String[] {
> >                 "As", "Ss", "EA", "Ss", "Vp"
> >             };
> >
> >             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> > pos)));
> >         }
> >
> > How can i analyze what happened?
> >
> > Thanks
> > Damiano
> >
>

Re: Lemmatizer BUG

Posted by Rodrigo Agerri <ra...@apache.org>.

Hello,

The String[] lemmatize(String[] toks, String[] tags) method will give you
predicted "lemma class" which consists of the number of permutations
required to go from the word form to the lemma.

If the output is O that means that no permutation is required, namely, the
lemma and the word form are considered to be the same string. The last item
in the array is for iniziata, and the class means "replace the letter t in
position 1 with r; replace letter a with letter e in position 0", resulting
in "iniziare". The word form and lemma strings are reversed for comparison.
I am assuming that you added the asterisks...

Once you have that lemma class prediction array, you need to apply the
String[] decodeLemmas(String[] toks, String[] preds) in the same
LemmatizerME class, which as the javacode states, it requires the arrays of
tokens and predicted lemma classes, to perform the decoding (apply the
permutations) and output the actual lemma (iniziare in your example).

Cheers,

Rodrigo

On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <da...@gmail.com>
wrote:

> Hello,
> I am doing some tests with the lemmatizerME.
> It is returning a wrong word, a word that never occurs in the training
> data. Basically it is NOT an italian word :)
>
> The output is:
>
> [O, O, O, O, *R1trR0ae*]
>
> The code:
>
>         try (InputStream in = new
> FileInputStream("/home/damiano/lemmas.bin")) {
>             LemmatizerModel lemmatizerModel = new LemmatizerModel(in);
>
>             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
>
>             String[] tokens = new String[] {
>                 "ultimo", "capitolo", "della", "saga", "iniziata"
>             };
>
>             String[] pos = new String[] {
>                 "As", "Ss", "EA", "Ss", "Vp"
>             };
>
>             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> pos)));
>         }
>
> How can i analyze what happened?
>
> Thanks
> Damiano
>