You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by "Zowalla, Richard" <ri...@hs-heilbronn.de> on 2022/04/11 12:47:03 UTC

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Hi all,

we are working on training a large opennlp maxent model for lemmatizing
German texts. We use a wikipedia tree bank from Tübingen.

This works fine for mid size corpora (just need a little bit of RAM and
time). However, we are running into the exception mentioned in [1].
Debugging into the DataOutputStream reveals, that this is a limitation
of the java.io.DataOutputstream.

Do we have any chance to solve this or do we need to implement custom
readers / writers in order to get it work?

If this is a general problem for large corpora, I am also happy to
create a related ticket / issue in Jira with steps to reproduce ;)

Thanks in advance.

Gruß
Richard

[1] 
https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long


-- 
Richard Zowalla, M.Sc.
Research Associate, PhD Student | Medical Informatics

Hochschule Heilbronn – University of Applied Sciences
Max-Planck-Str. 39 
D-74081 Heilbronn 
phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
mail: richard.zowalla@hs-heilbronn.de
web: https://www.mi.hs-heilbronn.de/ 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by Jeff Zemerick <jz...@apache.org>.
Thanks for trying it and for all the info! I will check it out and let you
know.

Thanks,
Jeff

On Sun, Apr 17, 2022 at 12:51 PM Zowalla, Richard <
richard.zowalla@hs-heilbronn.de> wrote:

> Hi Jeff,
>
> he did the validation again and it showed, that the IDE used an older
> version of OpenNLP.
>
> After a clean build with the freshly created SNAPSHOT, the model load
> resulted in another exception (which now looks reasonable to me).
>
> He updated his comment in [1]. Maybe you have an idea :)
>
> Thanks
> Richard
>
> [1]
>
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702
>
> Am Sonntag, dem 17.04.2022 um 16:33 +0000 schrieb Zowalla, Richard:
> > Hi Jeff,
> >
> > reading the stacktrace myself (now), I think, that an outdated
> > snapshot
> > was included for this test (as it doesn't fit the code).
> >
> > I will report back, if this is the case and Maven / Gradle / IDE did
> > something weird.
> >
> > Sorry & Gruß
> > Richard
> >
> > Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard:
> > > Hi Jeff,
> > >
> > > the task completed and we have some feedback.
> > >
> > > My colleague directly commented in the related commit [1].
> > >
> > > Writing the model seems to work but reading the resulting model
> > > fails.
> > >
> > > Gruß
> > > Richard
> > >
> > > [1]
> > >
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963
> > >
> > > Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla,
> > > Richard:
> > > > Hi Jeff,
> > > >
> > > > thanks for the update.
> > > >
> > > > We will give the change a try with a SNAPSHOT build including the
> > > > potential patch and start a run on the cluster with the Tübingen
> > > > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> > > > regarding
> > > > writeShort(...).
> > > >
> > > > Gruß
> > > > Richard
> > > >
> > > > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > > > > Luckily, this looks like a common problem [1] for years
> > > > > regarding
> > > > > writeUTF(). Following other guidance and the function's
> > > > > javadocs
> > > > > [2],
> > > > > writeUTF() writes the number of bytes written out followed by
> > > > > the
> > > > > string.
> > > > > Changing it to manually write the length of the string followed
> > > > > by
> > > > > write()
> > > > > allows the training to succeed. All unit tests pass and this
> > > > > seems
> > > > > to
> > > > > indicate it would be backward compatible because of unit tests
> > > > > that
> > > > > load
> > > > > models in src/test/resources/, but I want to verify that more
> > > > > to
> > > > > be
> > > > > sure.
> > > > >
> > > > > Here's the changes:
> > > > >
> https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > > > >
> > > > > I am unsure of the writeShort() method for writing the length
> > > > > of
> > > > > the
> > > > > string. Even though it works for the UD data now, is that
> > > > > actually
> > > > > resolving the problem?
> > > > >
> > > > > Anyone have any insights into this?
> > > > >
> > > > > Thanks,
> > > > > Jeff
> > > > >
> > > > > [1]
> > > > >
> https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > > > > [2]
> > > > >
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > > > >
> > > > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <
> > > > > jzemerick@apache.org
> > > > > wrote:
> > > > >
> > > > > > Great, thanks. I was able to reproduce the problem. I'll take
> > > > > > a
> > > > > > look and
> > > > > > keep this thread updated.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff
> > > > > >
> > > > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > thanks for the quick reply. Here it is:
> > > > > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > > > >
> > > > > > > Using the treebank from Tübingen might not be feasable as
> > > > > > > it
> > > > > > > consumes
> > > > > > > around 2 TB RAM ;) - the mentioned link in the ticket
> > > > > > > points
> > > > > > > to
> > > > > > > a
> > > > > > > smaller dataset, which should reproduce the issue with a
> > > > > > > feasable
> > > > > > > amount of required RAM.
> > > > > > >
> > > > > > > It basically boils down to a size limitation in the JDK's
> > > > > > > DataOutputStream.
> > > > > > >
> > > > > > > Gruß
> > > > > > > Richard
> > > > > > >
> > > > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff
> > > > > > > Zemerick:
> > > > > > > > Hi Richard,
> > > > > > > >
> > > > > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > > > > reproduce
> > > > > > > > it
> > > > > > > > would be
> > > > > > > > fantastic.
> > > > > > > > https://issues.apache.org/jira/projects/OPENNLP
> > > > > > > >
> > > > > > > > Please create one and reply back here with its ID once
> > > > > > > > you
> > > > > > > > do.
> > > > > > > > I can
> > > > > > > > take a
> > > > > > > > look and see what can be done.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jeff
> > > > > > > >
> > > > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > we are working on training a large opennlp maxent model
> > > > > > > > > for
> > > > > > > > > lemmatizing
> > > > > > > > > German texts. We use a wikipedia tree bank from
> > > > > > > > > Tübingen.
> > > > > > > > >
> > > > > > > > > This works fine for mid size corpora (just need a
> > > > > > > > > little
> > > > > > > > > bit
> > > > > > > > > of RAM
> > > > > > > > > and
> > > > > > > > > time). However, we are running into the exception
> > > > > > > > > mentioned
> > > > > > > > > in [1].
> > > > > > > > > Debugging into the DataOutputStream reveals, that this
> > > > > > > > > is
> > > > > > > > > a
> > > > > > > > > limitation
> > > > > > > > > of the java.io.DataOutputstream.
> > > > > > > > >
> > > > > > > > > Do we have any chance to solve this or do we need to
> > > > > > > > > implement
> > > > > > > > > custom
> > > > > > > > > readers / writers in order to get it work?
> > > > > > > > >
> > > > > > > > > If this is a general problem for large corpora, I am
> > > > > > > > > also
> > > > > > > > > happy to
> > > > > > > > > create a related ticket / issue in Jira with steps to
> > > > > > > > > reproduce ;)
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > >
> > > > > > > > > Gruß
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > > > > --
> > > > > > > > > Richard Zowalla, M.Sc.
> > > > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > > > >
> > > > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > > > Max-Planck-Str. 39
> > > > > > > > > D-74081 Heilbronn
> > > > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > > > erreichbar)
> > > > > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > > > >
> > > > > > > --
> > > > > > > Richard Zowalla, M.Sc.
> > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > >
> > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > Max-Planck-Str. 39
> > > > > > > D-74081 Heilbronn
> > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > erreichbar)
> > > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > >
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by "Zowalla, Richard" <ri...@hs-heilbronn.de>.
Hi Jeff,

he did the validation again and it showed, that the IDE used an older
version of OpenNLP. 

After a clean build with the freshly created SNAPSHOT, the model load
resulted in another exception (which now looks reasonable to me).

He updated his comment in [1]. Maybe you have an idea :)

Thanks
Richard

[1] 
https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702

Am Sonntag, dem 17.04.2022 um 16:33 +0000 schrieb Zowalla, Richard:
> Hi Jeff,
> 
> reading the stacktrace myself (now), I think, that an outdated
> snapshot
> was included for this test (as it doesn't fit the code).
> 
> I will report back, if this is the case and Maven / Gradle / IDE did
> something weird.
> 
> Sorry & Gruß
> Richard
> 
> Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard:
> > Hi Jeff,
> > 
> > the task completed and we have some feedback.
> > 
> > My colleague directly commented in the related commit [1].
> > 
> > Writing the model seems to work but reading the resulting model
> > fails.
> > 
> > Gruß
> > Richard
> > 
> > [1] 
> > https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963
> > 
> > Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla,
> > Richard:
> > > Hi Jeff,
> > > 
> > > thanks for the update.
> > > 
> > > We will give the change a try with a SNAPSHOT build including the
> > > potential patch and start a run on the cluster with the Tübingen
> > > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> > > regarding
> > > writeShort(...).
> > > 
> > > Gruß
> > > Richard 
> > > 
> > > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > > > Luckily, this looks like a common problem [1] for years
> > > > regarding
> > > > writeUTF(). Following other guidance and the function's
> > > > javadocs
> > > > [2],
> > > > writeUTF() writes the number of bytes written out followed by
> > > > the
> > > > string.
> > > > Changing it to manually write the length of the string followed
> > > > by
> > > > write()
> > > > allows the training to succeed. All unit tests pass and this
> > > > seems
> > > > to
> > > > indicate it would be backward compatible because of unit tests
> > > > that
> > > > load
> > > > models in src/test/resources/, but I want to verify that more
> > > > to
> > > > be
> > > > sure.
> > > > 
> > > > Here's the changes:
> > > > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > > > 
> > > > I am unsure of the writeShort() method for writing the length
> > > > of
> > > > the
> > > > string. Even though it works for the UD data now, is that
> > > > actually
> > > > resolving the problem?
> > > > 
> > > > Anyone have any insights into this?
> > > > 
> > > > Thanks,
> > > > Jeff
> > > > 
> > > > [1]
> > > > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > > > [2]
> > > > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > > > 
> > > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <
> > > > jzemerick@apache.org
> > > > wrote:
> > > > 
> > > > > Great, thanks. I was able to reproduce the problem. I'll take
> > > > > a
> > > > > look and
> > > > > keep this thread updated.
> > > > > 
> > > > > Thanks,
> > > > > Jeff
> > > > > 
> > > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > 
> > > > > > Hi Jeff,
> > > > > > 
> > > > > > thanks for the quick reply. Here it is:
> > > > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > > > 
> > > > > > Using the treebank from Tübingen might not be feasable as
> > > > > > it
> > > > > > consumes
> > > > > > around 2 TB RAM ;) - the mentioned link in the ticket
> > > > > > points
> > > > > > to
> > > > > > a
> > > > > > smaller dataset, which should reproduce the issue with a
> > > > > > feasable
> > > > > > amount of required RAM.
> > > > > > 
> > > > > > It basically boils down to a size limitation in the JDK's
> > > > > > DataOutputStream.
> > > > > > 
> > > > > > Gruß
> > > > > > Richard
> > > > > > 
> > > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff
> > > > > > Zemerick:
> > > > > > > Hi Richard,
> > > > > > > 
> > > > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > > > reproduce
> > > > > > > it
> > > > > > > would be
> > > > > > > fantastic. 
> > > > > > > https://issues.apache.org/jira/projects/OPENNLP
> > > > > > > 
> > > > > > > Please create one and reply back here with its ID once
> > > > > > > you
> > > > > > > do.
> > > > > > > I can
> > > > > > > take a
> > > > > > > look and see what can be done.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Jeff
> > > > > > > 
> > > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > > > 
> > > > > > > > Hi all,
> > > > > > > > 
> > > > > > > > we are working on training a large opennlp maxent model
> > > > > > > > for
> > > > > > > > lemmatizing
> > > > > > > > German texts. We use a wikipedia tree bank from
> > > > > > > > Tübingen.
> > > > > > > > 
> > > > > > > > This works fine for mid size corpora (just need a
> > > > > > > > little
> > > > > > > > bit
> > > > > > > > of RAM
> > > > > > > > and
> > > > > > > > time). However, we are running into the exception
> > > > > > > > mentioned
> > > > > > > > in [1].
> > > > > > > > Debugging into the DataOutputStream reveals, that this
> > > > > > > > is
> > > > > > > > a
> > > > > > > > limitation
> > > > > > > > of the java.io.DataOutputstream.
> > > > > > > > 
> > > > > > > > Do we have any chance to solve this or do we need to
> > > > > > > > implement
> > > > > > > > custom
> > > > > > > > readers / writers in order to get it work?
> > > > > > > > 
> > > > > > > > If this is a general problem for large corpora, I am
> > > > > > > > also
> > > > > > > > happy to
> > > > > > > > create a related ticket / issue in Jira with steps to
> > > > > > > > reproduce ;)
> > > > > > > > 
> > > > > > > > Thanks in advance.
> > > > > > > > 
> > > > > > > > Gruß
> > > > > > > > Richard
> > > > > > > > 
> > > > > > > > [1]
> > > > > > > > 
> > > > > > > > 
> > > > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > > > --
> > > > > > > > Richard Zowalla, M.Sc.
> > > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > > > 
> > > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > > Max-Planck-Str. 39
> > > > > > > > D-74081 Heilbronn
> > > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > > erreichbar)
> > > > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > > > 
> > > > > > --
> > > > > > Richard Zowalla, M.Sc.
> > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > 
> > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > Max-Planck-Str. 39
> > > > > > D-74081 Heilbronn
> > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > erreichbar)
> > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by "Zowalla, Richard" <ri...@hs-heilbronn.de>.
Hi Jeff,

reading the stacktrace myself (now), I think, that an outdated snapshot
was included for this test (as it doesn't fit the code).

I will report back, if this is the case and Maven / Gradle / IDE did
something weird.

Sorry & Gruß
Richard

Am Sonntag, dem 17.04.2022 um 16:26 +0000 schrieb Zowalla, Richard:
> Hi Jeff,
> 
> the task completed and we have some feedback.
> 
> My colleague directly commented in the related commit [1].
> 
> Writing the model seems to work but reading the resulting model
> fails.
> 
> Gruß
> Richard
> 
> [1] 
> https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963
> 
> Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla, Richard:
> > Hi Jeff,
> > 
> > thanks for the update.
> > 
> > We will give the change a try with a SNAPSHOT build including the
> > potential patch and start a run on the cluster with the Tübingen
> > Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> > regarding
> > writeShort(...).
> > 
> > Gruß
> > Richard 
> > 
> > Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > > Luckily, this looks like a common problem [1] for years regarding
> > > writeUTF(). Following other guidance and the function's javadocs
> > > [2],
> > > writeUTF() writes the number of bytes written out followed by the
> > > string.
> > > Changing it to manually write the length of the string followed
> > > by
> > > write()
> > > allows the training to succeed. All unit tests pass and this
> > > seems
> > > to
> > > indicate it would be backward compatible because of unit tests
> > > that
> > > load
> > > models in src/test/resources/, but I want to verify that more to
> > > be
> > > sure.
> > > 
> > > Here's the changes:
> > > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > > 
> > > I am unsure of the writeShort() method for writing the length of
> > > the
> > > string. Even though it works for the UD data now, is that
> > > actually
> > > resolving the problem?
> > > 
> > > Anyone have any insights into this?
> > > 
> > > Thanks,
> > > Jeff
> > > 
> > > [1]
> > > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > > [2]
> > > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > > 
> > > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <
> > > jzemerick@apache.org
> > > wrote:
> > > 
> > > > Great, thanks. I was able to reproduce the problem. I'll take a
> > > > look and
> > > > keep this thread updated.
> > > > 
> > > > Thanks,
> > > > Jeff
> > > > 
> > > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > 
> > > > > Hi Jeff,
> > > > > 
> > > > > thanks for the quick reply. Here it is:
> > > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > > 
> > > > > Using the treebank from Tübingen might not be feasable as it
> > > > > consumes
> > > > > around 2 TB RAM ;) - the mentioned link in the ticket points
> > > > > to
> > > > > a
> > > > > smaller dataset, which should reproduce the issue with a
> > > > > feasable
> > > > > amount of required RAM.
> > > > > 
> > > > > It basically boils down to a size limitation in the JDK's
> > > > > DataOutputStream.
> > > > > 
> > > > > Gruß
> > > > > Richard
> > > > > 
> > > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff
> > > > > Zemerick:
> > > > > > Hi Richard,
> > > > > > 
> > > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > > reproduce
> > > > > > it
> > > > > > would be
> > > > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > > > 
> > > > > > Please create one and reply back here with its ID once you
> > > > > > do.
> > > > > > I can
> > > > > > take a
> > > > > > look and see what can be done.
> > > > > > 
> > > > > > Thanks,
> > > > > > Jeff
> > > > > > 
> > > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > > 
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > we are working on training a large opennlp maxent model
> > > > > > > for
> > > > > > > lemmatizing
> > > > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > > > 
> > > > > > > This works fine for mid size corpora (just need a little
> > > > > > > bit
> > > > > > > of RAM
> > > > > > > and
> > > > > > > time). However, we are running into the exception
> > > > > > > mentioned
> > > > > > > in [1].
> > > > > > > Debugging into the DataOutputStream reveals, that this is
> > > > > > > a
> > > > > > > limitation
> > > > > > > of the java.io.DataOutputstream.
> > > > > > > 
> > > > > > > Do we have any chance to solve this or do we need to
> > > > > > > implement
> > > > > > > custom
> > > > > > > readers / writers in order to get it work?
> > > > > > > 
> > > > > > > If this is a general problem for large corpora, I am also
> > > > > > > happy to
> > > > > > > create a related ticket / issue in Jira with steps to
> > > > > > > reproduce ;)
> > > > > > > 
> > > > > > > Thanks in advance.
> > > > > > > 
> > > > > > > Gruß
> > > > > > > Richard
> > > > > > > 
> > > > > > > [1]
> > > > > > > 
> > > > > > > 
> > > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > > --
> > > > > > > Richard Zowalla, M.Sc.
> > > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > > 
> > > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > > Max-Planck-Str. 39
> > > > > > > D-74081 Heilbronn
> > > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > > erreichbar)
> > > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > > 
> > > > > --
> > > > > Richard Zowalla, M.Sc.
> > > > > Research Associate, PhD Student | Medical Informatics
> > > > > 
> > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > Max-Planck-Str. 39
> > > > > D-74081 Heilbronn
> > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > erreichbar)
> > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by "Zowalla, Richard" <ri...@hs-heilbronn.de>.
Hi Jeff,

the task completed and we have some feedback.

My colleague directly commented in the related commit [1].

Writing the model seems to work but reading the resulting model fails.

Gruß
Richard

[1] 
https://github.com/apache/opennlp/commit/803f5a4f3a938b7e19ad0be6915097708348e702#commitcomment-71463963

Am Dienstag, dem 12.04.2022 um 14:40 +0000 schrieb Zowalla, Richard:
> Hi Jeff,
> 
> thanks for the update.
> 
> We will give the change a try with a SNAPSHOT build including the
> potential patch and start a run on the cluster with the Tübingen
> Wikipedia Treebank. Guess we will have feedback in ~ 48 hours
> regarding
> writeShort(...).
> 
> Gruß
> Richard 
> 
> Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> > Luckily, this looks like a common problem [1] for years regarding
> > writeUTF(). Following other guidance and the function's javadocs
> > [2],
> > writeUTF() writes the number of bytes written out followed by the
> > string.
> > Changing it to manually write the length of the string followed by
> > write()
> > allows the training to succeed. All unit tests pass and this seems
> > to
> > indicate it would be backward compatible because of unit tests that
> > load
> > models in src/test/resources/, but I want to verify that more to be
> > sure.
> > 
> > Here's the changes:
> > https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> > 
> > I am unsure of the writeShort() method for writing the length of
> > the
> > string. Even though it works for the UD data now, is that actually
> > resolving the problem?
> > 
> > Anyone have any insights into this?
> > 
> > Thanks,
> > Jeff
> > 
> > [1]
> > https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> > [2]
> > https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> > 
> > On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <jzemerick@apache.org
> > >
> > wrote:
> > 
> > > Great, thanks. I was able to reproduce the problem. I'll take a
> > > look and
> > > keep this thread updated.
> > > 
> > > Thanks,
> > > Jeff
> > > 
> > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > > richard.zowalla@hs-heilbronn.de> wrote:
> > > 
> > > > Hi Jeff,
> > > > 
> > > > thanks for the quick reply. Here it is:
> > > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > > 
> > > > Using the treebank from Tübingen might not be feasable as it
> > > > consumes
> > > > around 2 TB RAM ;) - the mentioned link in the ticket points to
> > > > a
> > > > smaller dataset, which should reproduce the issue with a
> > > > feasable
> > > > amount of required RAM.
> > > > 
> > > > It basically boils down to a size limitation in the JDK's
> > > > DataOutputStream.
> > > > 
> > > > Gruß
> > > > Richard
> > > > 
> > > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > > > > Hi Richard,
> > > > > 
> > > > > Thanks for reporting this. A Jira issue with steps to
> > > > > reproduce
> > > > > it
> > > > > would be
> > > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > > 
> > > > > Please create one and reply back here with its ID once you
> > > > > do.
> > > > > I can
> > > > > take a
> > > > > look and see what can be done.
> > > > > 
> > > > > Thanks,
> > > > > Jeff
> > > > > 
> > > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > > we are working on training a large opennlp maxent model for
> > > > > > lemmatizing
> > > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > > 
> > > > > > This works fine for mid size corpora (just need a little
> > > > > > bit
> > > > > > of RAM
> > > > > > and
> > > > > > time). However, we are running into the exception mentioned
> > > > > > in [1].
> > > > > > Debugging into the DataOutputStream reveals, that this is a
> > > > > > limitation
> > > > > > of the java.io.DataOutputstream.
> > > > > > 
> > > > > > Do we have any chance to solve this or do we need to
> > > > > > implement
> > > > > > custom
> > > > > > readers / writers in order to get it work?
> > > > > > 
> > > > > > If this is a general problem for large corpora, I am also
> > > > > > happy to
> > > > > > create a related ticket / issue in Jira with steps to
> > > > > > reproduce ;)
> > > > > > 
> > > > > > Thanks in advance.
> > > > > > 
> > > > > > Gruß
> > > > > > Richard
> > > > > > 
> > > > > > [1]
> > > > > > 
> > > > > > 
> > > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > > --
> > > > > > Richard Zowalla, M.Sc.
> > > > > > Research Associate, PhD Student | Medical Informatics
> > > > > > 
> > > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > > Max-Planck-Str. 39
> > > > > > D-74081 Heilbronn
> > > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > > erreichbar)
> > > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > > 
> > > > --
> > > > Richard Zowalla, M.Sc.
> > > > Research Associate, PhD Student | Medical Informatics
> > > > 
> > > > Hochschule Heilbronn – University of Applied Sciences
> > > > Max-Planck-Str. 39
> > > > D-74081 Heilbronn
> > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > erreichbar)
> > > > mail: richard.zowalla@hs-heilbronn.de
> > > > web: https://www.mi.hs-heilbronn.de/
> > > > 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by "Zowalla, Richard" <ri...@hs-heilbronn.de>.
Hi Jeff,

thanks for the update.

We will give the change a try with a SNAPSHOT build including the
potential patch and start a run on the cluster with the Tübingen
Wikipedia Treebank. Guess we will have feedback in ~ 48 hours regarding
writeShort(...).

Gruß
Richard 

Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> Luckily, this looks like a common problem [1] for years regarding
> writeUTF(). Following other guidance and the function's javadocs [2],
> writeUTF() writes the number of bytes written out followed by the
> string.
> Changing it to manually write the length of the string followed by
> write()
> allows the training to succeed. All unit tests pass and this seems to
> indicate it would be backward compatible because of unit tests that
> load
> models in src/test/resources/, but I want to verify that more to be
> sure.
> 
> Here's the changes:
> https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> 
> I am unsure of the writeShort() method for writing the length of the
> string. Even though it works for the UD data now, is that actually
> resolving the problem?
> 
> Anyone have any insights into this?
> 
> Thanks,
> Jeff
> 
> [1]
> https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> [2]
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> 
> On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <jz...@apache.org>
> wrote:
> 
> > Great, thanks. I was able to reproduce the problem. I'll take a
> > look and
> > keep this thread updated.
> > 
> > Thanks,
> > Jeff
> > 
> > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > richard.zowalla@hs-heilbronn.de> wrote:
> > 
> > > Hi Jeff,
> > > 
> > > thanks for the quick reply. Here it is:
> > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > 
> > > Using the treebank from Tübingen might not be feasable as it
> > > consumes
> > > around 2 TB RAM ;) - the mentioned link in the ticket points to a
> > > smaller dataset, which should reproduce the issue with a feasable
> > > amount of required RAM.
> > > 
> > > It basically boils down to a size limitation in the JDK's
> > > DataOutputStream.
> > > 
> > > Gruß
> > > Richard
> > > 
> > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > > > Hi Richard,
> > > > 
> > > > Thanks for reporting this. A Jira issue with steps to reproduce
> > > > it
> > > > would be
> > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > 
> > > > Please create one and reply back here with its ID once you do.
> > > > I can
> > > > take a
> > > > look and see what can be done.
> > > > 
> > > > Thanks,
> > > > Jeff
> > > > 
> > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > richard.zowalla@hs-heilbronn.de> wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > we are working on training a large opennlp maxent model for
> > > > > lemmatizing
> > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > 
> > > > > This works fine for mid size corpora (just need a little bit
> > > > > of RAM
> > > > > and
> > > > > time). However, we are running into the exception mentioned
> > > > > in [1].
> > > > > Debugging into the DataOutputStream reveals, that this is a
> > > > > limitation
> > > > > of the java.io.DataOutputstream.
> > > > > 
> > > > > Do we have any chance to solve this or do we need to
> > > > > implement
> > > > > custom
> > > > > readers / writers in order to get it work?
> > > > > 
> > > > > If this is a general problem for large corpora, I am also
> > > > > happy to
> > > > > create a related ticket / issue in Jira with steps to
> > > > > reproduce ;)
> > > > > 
> > > > > Thanks in advance.
> > > > > 
> > > > > Gruß
> > > > > Richard
> > > > > 
> > > > > [1]
> > > > > 
> > > > > 
> > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > 
> > > > > --
> > > > > Richard Zowalla, M.Sc.
> > > > > Research Associate, PhD Student | Medical Informatics
> > > > > 
> > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > Max-Planck-Str. 39
> > > > > D-74081 Heilbronn
> > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > erreichbar)
> > > > > mail: richard.zowalla@hs-heilbronn.de
> > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > 
> > > --
> > > Richard Zowalla, M.Sc.
> > > Research Associate, PhD Student | Medical Informatics
> > > 
> > > Hochschule Heilbronn – University of Applied Sciences
> > > Max-Planck-Str. 39
> > > D-74081 Heilbronn
> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > > mail: richard.zowalla@hs-heilbronn.de
> > > web: https://www.mi.hs-heilbronn.de/
> > > 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by Jeff Zemerick <jz...@apache.org>.
Luckily, this looks like a common problem [1] for years regarding
writeUTF(). Following other guidance and the function's javadocs [2],
writeUTF() writes the number of bytes written out followed by the string.
Changing it to manually write the length of the string followed by write()
allows the training to succeed. All unit tests pass and this seems to
indicate it would be backward compatible because of unit tests that load
models in src/test/resources/, but I want to verify that more to be sure.

Here's the changes:
https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1

I am unsure of the writeShort() method for writing the length of the
string. Even though it works for the UD data now, is that actually
resolving the problem?

Anyone have any insights into this?

Thanks,
Jeff

[1]
https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
[2]
https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)

On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <jz...@apache.org> wrote:

> Great, thanks. I was able to reproduce the problem. I'll take a look and
> keep this thread updated.
>
> Thanks,
> Jeff
>
> On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> richard.zowalla@hs-heilbronn.de> wrote:
>
>> Hi Jeff,
>>
>> thanks for the quick reply. Here it is:
>> https://issues.apache.org/jira/browse/OPENNLP-1366
>>
>> Using the treebank from Tübingen might not be feasable as it consumes
>> around 2 TB RAM ;) - the mentioned link in the ticket points to a
>> smaller dataset, which should reproduce the issue with a feasable
>> amount of required RAM.
>>
>> It basically boils down to a size limitation in the JDK's
>> DataOutputStream.
>>
>> Gruß
>> Richard
>>
>> Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
>> > Hi Richard,
>> >
>> > Thanks for reporting this. A Jira issue with steps to reproduce it
>> > would be
>> > fantastic. https://issues.apache.org/jira/projects/OPENNLP
>> >
>> > Please create one and reply back here with its ID once you do. I can
>> > take a
>> > look and see what can be done.
>> >
>> > Thanks,
>> > Jeff
>> >
>> > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
>> > richard.zowalla@hs-heilbronn.de> wrote:
>> >
>> > > Hi all,
>> > >
>> > > we are working on training a large opennlp maxent model for
>> > > lemmatizing
>> > > German texts. We use a wikipedia tree bank from Tübingen.
>> > >
>> > > This works fine for mid size corpora (just need a little bit of RAM
>> > > and
>> > > time). However, we are running into the exception mentioned in [1].
>> > > Debugging into the DataOutputStream reveals, that this is a
>> > > limitation
>> > > of the java.io.DataOutputstream.
>> > >
>> > > Do we have any chance to solve this or do we need to implement
>> > > custom
>> > > readers / writers in order to get it work?
>> > >
>> > > If this is a general problem for large corpora, I am also happy to
>> > > create a related ticket / issue in Jira with steps to reproduce ;)
>> > >
>> > > Thanks in advance.
>> > >
>> > > Gruß
>> > > Richard
>> > >
>> > > [1]
>> > >
>> > >
>> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
>> > >
>> > >
>> > > --
>> > > Richard Zowalla, M.Sc.
>> > > Research Associate, PhD Student | Medical Informatics
>> > >
>> > > Hochschule Heilbronn – University of Applied Sciences
>> > > Max-Planck-Str. 39
>> > > D-74081 Heilbronn
>> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
>> > > mail: richard.zowalla@hs-heilbronn.de
>> > > web: https://www.mi.hs-heilbronn.de/
>> > >
>> --
>> Richard Zowalla, M.Sc.
>> Research Associate, PhD Student | Medical Informatics
>>
>> Hochschule Heilbronn – University of Applied Sciences
>> Max-Planck-Str. 39
>> D-74081 Heilbronn
>> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
>> mail: richard.zowalla@hs-heilbronn.de
>> web: https://www.mi.hs-heilbronn.de/
>>
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by Jeff Zemerick <jz...@apache.org>.
Great, thanks. I was able to reproduce the problem. I'll take a look and
keep this thread updated.

Thanks,
Jeff

On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
richard.zowalla@hs-heilbronn.de> wrote:

> Hi Jeff,
>
> thanks for the quick reply. Here it is:
> https://issues.apache.org/jira/browse/OPENNLP-1366
>
> Using the treebank from Tübingen might not be feasable as it consumes
> around 2 TB RAM ;) - the mentioned link in the ticket points to a
> smaller dataset, which should reproduce the issue with a feasable
> amount of required RAM.
>
> It basically boils down to a size limitation in the JDK's
> DataOutputStream.
>
> Gruß
> Richard
>
> Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > Hi Richard,
> >
> > Thanks for reporting this. A Jira issue with steps to reproduce it
> > would be
> > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> >
> > Please create one and reply back here with its ID once you do. I can
> > take a
> > look and see what can be done.
> >
> > Thanks,
> > Jeff
> >
> > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > richard.zowalla@hs-heilbronn.de> wrote:
> >
> > > Hi all,
> > >
> > > we are working on training a large opennlp maxent model for
> > > lemmatizing
> > > German texts. We use a wikipedia tree bank from Tübingen.
> > >
> > > This works fine for mid size corpora (just need a little bit of RAM
> > > and
> > > time). However, we are running into the exception mentioned in [1].
> > > Debugging into the DataOutputStream reveals, that this is a
> > > limitation
> > > of the java.io.DataOutputstream.
> > >
> > > Do we have any chance to solve this or do we need to implement
> > > custom
> > > readers / writers in order to get it work?
> > >
> > > If this is a general problem for large corpora, I am also happy to
> > > create a related ticket / issue in Jira with steps to reproduce ;)
> > >
> > > Thanks in advance.
> > >
> > > Gruß
> > > Richard
> > >
> > > [1]
> > >
> > >
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > >
> > >
> > > --
> > > Richard Zowalla, M.Sc.
> > > Research Associate, PhD Student | Medical Informatics
> > >
> > > Hochschule Heilbronn – University of Applied Sciences
> > > Max-Planck-Str. 39
> > > D-74081 Heilbronn
> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > > mail: richard.zowalla@hs-heilbronn.de
> > > web: https://www.mi.hs-heilbronn.de/
> > >
> --
> Richard Zowalla, M.Sc.
> Research Associate, PhD Student | Medical Informatics
>
> Hochschule Heilbronn – University of Applied Sciences
> Max-Planck-Str. 39
> D-74081 Heilbronn
> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> mail: richard.zowalla@hs-heilbronn.de
> web: https://www.mi.hs-heilbronn.de/
>

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by "Zowalla, Richard" <ri...@hs-heilbronn.de>.
Hi Jeff,

thanks for the quick reply. Here it is: 
https://issues.apache.org/jira/browse/OPENNLP-1366

Using the treebank from Tübingen might not be feasable as it consumes
around 2 TB RAM ;) - the mentioned link in the ticket points to a
smaller dataset, which should reproduce the issue with a feasable
amount of required RAM.

It basically boils down to a size limitation in the JDK's
DataOutputStream. 

Gruß
Richard

Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> Hi Richard,
> 
> Thanks for reporting this. A Jira issue with steps to reproduce it
> would be
> fantastic. https://issues.apache.org/jira/projects/OPENNLP
> 
> Please create one and reply back here with its ID once you do. I can
> take a
> look and see what can be done.
> 
> Thanks,
> Jeff
> 
> On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> richard.zowalla@hs-heilbronn.de> wrote:
> 
> > Hi all,
> > 
> > we are working on training a large opennlp maxent model for
> > lemmatizing
> > German texts. We use a wikipedia tree bank from Tübingen.
> > 
> > This works fine for mid size corpora (just need a little bit of RAM
> > and
> > time). However, we are running into the exception mentioned in [1].
> > Debugging into the DataOutputStream reveals, that this is a
> > limitation
> > of the java.io.DataOutputstream.
> > 
> > Do we have any chance to solve this or do we need to implement
> > custom
> > readers / writers in order to get it work?
> > 
> > If this is a general problem for large corpora, I am also happy to
> > create a related ticket / issue in Jira with steps to reproduce ;)
> > 
> > Thanks in advance.
> > 
> > Gruß
> > Richard
> > 
> > [1]
> > 
> > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > 
> > 
> > --
> > Richard Zowalla, M.Sc.
> > Research Associate, PhD Student | Medical Informatics
> > 
> > Hochschule Heilbronn – University of Applied Sciences
> > Max-Planck-Str. 39
> > D-74081 Heilbronn
> > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > mail: richard.zowalla@hs-heilbronn.de
> > web: https://www.mi.hs-heilbronn.de/
> > 
-- 
Richard Zowalla, M.Sc.
Research Associate, PhD Student | Medical Informatics

Hochschule Heilbronn – University of Applied Sciences
Max-Planck-Str. 39 
D-74081 Heilbronn 
phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
mail: richard.zowalla@hs-heilbronn.de
web: https://www.mi.hs-heilbronn.de/ 

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Posted by Jeff Zemerick <jz...@apache.org>.
Hi Richard,

Thanks for reporting this. A Jira issue with steps to reproduce it would be
fantastic. https://issues.apache.org/jira/projects/OPENNLP

Please create one and reply back here with its ID once you do. I can take a
look and see what can be done.

Thanks,
Jeff

On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
richard.zowalla@hs-heilbronn.de> wrote:

> Hi all,
>
> we are working on training a large opennlp maxent model for lemmatizing
> German texts. We use a wikipedia tree bank from Tübingen.
>
> This works fine for mid size corpora (just need a little bit of RAM and
> time). However, we are running into the exception mentioned in [1].
> Debugging into the DataOutputStream reveals, that this is a limitation
> of the java.io.DataOutputstream.
>
> Do we have any chance to solve this or do we need to implement custom
> readers / writers in order to get it work?
>
> If this is a general problem for large corpora, I am also happy to
> create a related ticket / issue in Jira with steps to reproduce ;)
>
> Thanks in advance.
>
> Gruß
> Richard
>
> [1]
>
> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
>
>
> --
> Richard Zowalla, M.Sc.
> Research Associate, PhD Student | Medical Informatics
>
> Hochschule Heilbronn – University of Applied Sciences
> Max-Planck-Str. 39
> D-74081 Heilbronn
> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> mail: richard.zowalla@hs-heilbronn.de
> web: https://www.mi.hs-heilbronn.de/
>