You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Ling <li...@gmail.com> on 2017/06/29 02:04:02 UTC

Is this a typical OpenNLP tokenization issue?

Hi, all:

I am testing openNLP and found some significant tokenization issue
involving punctuation.

Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.

In all these cases, the last punctuation is not split so "Costco!" and
"IKEA." are treated as one token. This looks like a systematic problem.
Before I file an issue on OpenNLP project, I want to make sure this issue
is true coming from the library.

Does any of you encounter similar problem? Thanks.

Re: Is this a typical OpenNLP tokenization issue?

Posted by Suneel Marthi <su...@gmail.com>.

Well u could wait until next release for newer models

Sent from my iPhone

> On Jun 29, 2017, at 8:47 PM, Ling <li...@gmail.com> wrote:
> 
> These are my original concerns. In the deeplearning4j, which uses openNLP
> 1.5, they treat "Costco!" and "IKEA." and similar things as one token. Jörn
> said it's due to old Models.
> 
> Thank you Costco!
> i love costco!
> I love Costco!!
> FUCK IKEA.
> 
> On Thu, Jun 29, 2017 at 5:39 PM, Suneel Marthi <su...@gmail.com>
> wrote:
> 
>>> On Thu, Jun 29, 2017 at 8:36 PM, Ling <li...@gmail.com> wrote:
>>> 
>>> Hi, Suneel , that's great. The reason was that I wanted to do something
>> in
>>> DeepLearnig4j and happened to find that openNLP was integrated into it
>>> already. So I just used their API to call openNLP.
>>> 
>>> Is there a set date for next release? Also, are the 1.5 models the same
>> as
>>> the models to be included in the 1.81 release?
>>> 
>> 
>> shuld be some time next week.
>> 
>> if u r talking about the usage by 'models being the same', yes nothing
>> changes in how u invoke the model from ur code.
>> 
>>> 
>>> Thanks.
>>> Ling
>>> 
>>> On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <sm...@apache.org>
>> wrote:
>>> 
>>>>> On Thu, Jun 29, 2017 at 8:07 PM, Ling <li...@gmail.com> wrote:
>>>>> 
>>>>> Hi, Jörn:
>>>>> 
>>>>> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
>>>>> included the Maven 1.8 version in my POM file, then do I still need
>> to
>>>>> download the models separately? And I can't find those model files.
>> For
>>>>> example, to do a simple test on tokenization model,
>>>>> 
>>>> 
>>>> Dl4j is for Deep learning, OpenNLP is for text processing - not sure
>> why
>>>> you would go to DL4J first and revert back to OpenNLP if all u want to
>> do
>>>> is basic text processing.
>>>> 
>>>> The model files (1.5 models) are presently at -
>>>> http://opennlp.sourceforge.net/models-1.5/
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> InputStream is = new FileInputStream("en-token.bin");
>>>>> 
>>>>> Do I have to download the en-token.bin separately? I am working in a
>>>> maven
>>>>> projects. Thank you
>>>> 
>>>> 
>>>> Yes, the models need to be downloaded separately.
>>>> 
>>>> We finally got approval from Apache Foundation to distribute OpenNLP
>>> models
>>>> thru Apache, following the upcoming 1.8.1 release we should be
>>> distributing
>>>> updated 1.8.1 models too once we hash out the details for doing that.
>>>> 
>>>> 
>>>>> .
>>>>> 
>>>>> Ling
>>>>> 
>>>>> 
>>>>> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <kottmann@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Long chain, yes, then you probably use the SourceForge tokenization
>>>>>> model that was trained on some old news.
>>>>>> 
>>>>>> We usually don't consider mistakes the models do as bugs because we
>>>>>> can't do much about it other than suggesting to use models that fit
>>>>>> your data very well and even in that case models can be wrong
>>>>>> sometimes.
>>>>>> 
>>>>>> If there is something we can do here to reduce the error rate then
>> we
>>>>>> are very happy to get that as a contribution or just pointed out.
>>>>>> 
>>>>>> Jörn
>>>>>> 
>>>>>>> On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
>>>>>>> Hi, Jörn:
>>>>>>> 
>>>>>>> I am using a Deeplearning4j, which uses org.apache.uima library I
>>>>> think.
>>>>>>> And then UIMA uses openNLP. Probably that's what happens.
>>>>>>> 
>>>>>>> So it isn't openNLP's original problem? Thank you.
>>>>>>> 
>>>>>>> Ling
>>>>>>> 
>>>>>>> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
>>> kottmann@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> which model are you using? Did you train it yourself?
>>>>>>>> 
>>>>>>>> Jörn
>>>>>>>> 
>>>>>>>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com>
>> wrote:
>>>>>>>>> Hi, all:
>>>>>>>>> 
>>>>>>>>> I am testing openNLP and found some significant tokenization
>>> issue
>>>>>>>>> involving punctuation.
>>>>>>>>> 
>>>>>>>>> Thank you Costco!
>>>>>>>>> i love costco!
>>>>>>>>> I love Costco!!
>>>>>>>>> FUCK IKEA.
>>>>>>>>> 
>>>>>>>>> In all these cases, the last punctuation is not split so
>>> "Costco!"
>>>>> and
>>>>>>>>> "IKEA." are treated as one token. This looks like a systematic
>>>>>> problem.
>>>>>>>>> Before I file an issue on OpenNLP project, I want to make sure
>>>> this
>>>>>> issue
>>>>>>>>> is true coming from the library.
>>>>>>>>> 
>>>>>>>>> Does any of you encounter similar problem? Thanks.
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Ling <li...@gmail.com>.

These are my original concerns. In the deeplearning4j, which uses openNLP
1.5, they treat "Costco!" and "IKEA." and similar things as one token. Jörn
said it's due to old Models.

Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.

On Thu, Jun 29, 2017 at 5:39 PM, Suneel Marthi <su...@gmail.com>
wrote:

> On Thu, Jun 29, 2017 at 8:36 PM, Ling <li...@gmail.com> wrote:
>
> > Hi, Suneel , that's great. The reason was that I wanted to do something
> in
> > DeepLearnig4j and happened to find that openNLP was integrated into it
> > already. So I just used their API to call openNLP.
> >
> > Is there a set date for next release? Also, are the 1.5 models the same
> as
> > the models to be included in the 1.81 release?
> >
>
> shuld be some time next week.
>
> if u r talking about the usage by 'models being the same', yes nothing
> changes in how u invoke the model from ur code.
>
> >
> > Thanks.
> > Ling
> >
> > On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <sm...@apache.org>
> wrote:
> >
> > > On Thu, Jun 29, 2017 at 8:07 PM, Ling <li...@gmail.com> wrote:
> > >
> > > > Hi, Jörn:
> > > >
> > > > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > > > included the Maven 1.8 version in my POM file, then do I still need
> to
> > > > download the models separately? And I can't find those model files.
> For
> > > > example, to do a simple test on tokenization model,
> > > >
> > >
> > > Dl4j is for Deep learning, OpenNLP is for text processing - not sure
> why
> > > you would go to DL4J first and revert back to OpenNLP if all u want to
> do
> > > is basic text processing.
> > >
> > > The model files (1.5 models) are presently at -
> > > http://opennlp.sourceforge.net/models-1.5/
> > >
> > >
> > >
> > > >
> > > > InputStream is = new FileInputStream("en-token.bin");
> > > >
> > > > Do I have to download the en-token.bin separately? I am working in a
> > > maven
> > > > projects. Thank you
> > >
> > >
> > > Yes, the models need to be downloaded separately.
> > >
> > > We finally got approval from Apache Foundation to distribute OpenNLP
> > models
> > > thru Apache, following the upcoming 1.8.1 release we should be
> > distributing
> > > updated 1.8.1 models too once we hash out the details for doing that.
> > >
> > >
> > > > .
> > > >
> > > > Ling
> > > >
> > > >
> > > > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <kottmann@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Long chain, yes, then you probably use the SourceForge tokenization
> > > > > model that was trained on some old news.
> > > > >
> > > > > We usually don't consider mistakes the models do as bugs because we
> > > > > can't do much about it other than suggesting to use models that fit
> > > > > your data very well and even in that case models can be wrong
> > > > > sometimes.
> > > > >
> > > > > If there is something we can do here to reduce the error rate then
> we
> > > > > are very happy to get that as a contribution or just pointed out.
> > > > >
> > > > > Jörn
> > > > >
> > > > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> > > > > > Hi, Jörn:
> > > > > >
> > > > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > > > think.
> > > > > > And then UIMA uses openNLP. Probably that's what happens.
> > > > > >
> > > > > > So it isn't openNLP's original problem? Thank you.
> > > > > >
> > > > > > Ling
> > > > > >
> > > > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
> > kottmann@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> which model are you using? Did you train it yourself?
> > > > > >>
> > > > > >> Jörn
> > > > > >>
> > > > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com>
> wrote:
> > > > > >> > Hi, all:
> > > > > >> >
> > > > > >> > I am testing openNLP and found some significant tokenization
> > issue
> > > > > >> > involving punctuation.
> > > > > >> >
> > > > > >> > Thank you Costco!
> > > > > >> > i love costco!
> > > > > >> > I love Costco!!
> > > > > >> > FUCK IKEA.
> > > > > >> >
> > > > > >> > In all these cases, the last punctuation is not split so
> > "Costco!"
> > > > and
> > > > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > > > problem.
> > > > > >> > Before I file an issue on OpenNLP project, I want to make sure
> > > this
> > > > > issue
> > > > > >> > is true coming from the library.
> > > > > >> >
> > > > > >> > Does any of you encounter similar problem? Thanks.
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Suneel Marthi <su...@gmail.com>.

On Thu, Jun 29, 2017 at 8:36 PM, Ling <li...@gmail.com> wrote:

> Hi, Suneel , that's great. The reason was that I wanted to do something in
> DeepLearnig4j and happened to find that openNLP was integrated into it
> already. So I just used their API to call openNLP.
>
> Is there a set date for next release? Also, are the 1.5 models the same as
> the models to be included in the 1.81 release?
>

shuld be some time next week.

if u r talking about the usage by 'models being the same', yes nothing
changes in how u invoke the model from ur code.

>
> Thanks.
> Ling
>
> On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <sm...@apache.org> wrote:
>
> > On Thu, Jun 29, 2017 at 8:07 PM, Ling <li...@gmail.com> wrote:
> >
> > > Hi, Jörn:
> > >
> > > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > > included the Maven 1.8 version in my POM file, then do I still need to
> > > download the models separately? And I can't find those model files. For
> > > example, to do a simple test on tokenization model,
> > >
> >
> > Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
> > you would go to DL4J first and revert back to OpenNLP if all u want to do
> > is basic text processing.
> >
> > The model files (1.5 models) are presently at -
> > http://opennlp.sourceforge.net/models-1.5/
> >
> >
> >
> > >
> > > InputStream is = new FileInputStream("en-token.bin");
> > >
> > > Do I have to download the en-token.bin separately? I am working in a
> > maven
> > > projects. Thank you
> >
> >
> > Yes, the models need to be downloaded separately.
> >
> > We finally got approval from Apache Foundation to distribute OpenNLP
> models
> > thru Apache, following the upcoming 1.8.1 release we should be
> distributing
> > updated 1.8.1 models too once we hash out the details for doing that.
> >
> >
> > > .
> > >
> > > Ling
> > >
> > >
> > > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <ko...@gmail.com>
> > > wrote:
> > >
> > > > Long chain, yes, then you probably use the SourceForge tokenization
> > > > model that was trained on some old news.
> > > >
> > > > We usually don't consider mistakes the models do as bugs because we
> > > > can't do much about it other than suggesting to use models that fit
> > > > your data very well and even in that case models can be wrong
> > > > sometimes.
> > > >
> > > > If there is something we can do here to reduce the error rate then we
> > > > are very happy to get that as a contribution or just pointed out.
> > > >
> > > > Jörn
> > > >
> > > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> > > > > Hi, Jörn:
> > > > >
> > > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > > think.
> > > > > And then UIMA uses openNLP. Probably that's what happens.
> > > > >
> > > > > So it isn't openNLP's original problem? Thank you.
> > > > >
> > > > > Ling
> > > > >
> > > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
> kottmann@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> which model are you using? Did you train it yourself?
> > > > >>
> > > > >> Jörn
> > > > >>
> > > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> > > > >> > Hi, all:
> > > > >> >
> > > > >> > I am testing openNLP and found some significant tokenization
> issue
> > > > >> > involving punctuation.
> > > > >> >
> > > > >> > Thank you Costco!
> > > > >> > i love costco!
> > > > >> > I love Costco!!
> > > > >> > FUCK IKEA.
> > > > >> >
> > > > >> > In all these cases, the last punctuation is not split so
> "Costco!"
> > > and
> > > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > > problem.
> > > > >> > Before I file an issue on OpenNLP project, I want to make sure
> > this
> > > > issue
> > > > >> > is true coming from the library.
> > > > >> >
> > > > >> > Does any of you encounter similar problem? Thanks.
> > > > >>
> > > >
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Ling <li...@gmail.com>.

Hi, Suneel , that's great. The reason was that I wanted to do something in
DeepLearnig4j and happened to find that openNLP was integrated into it
already. So I just used their API to call openNLP.

Is there a set date for next release? Also, are the 1.5 models the same as
the models to be included in the 1.81 release?

Thanks.
Ling

On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <sm...@apache.org> wrote:

> On Thu, Jun 29, 2017 at 8:07 PM, Ling <li...@gmail.com> wrote:
>
> > Hi, Jörn:
> >
> > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > included the Maven 1.8 version in my POM file, then do I still need to
> > download the models separately? And I can't find those model files. For
> > example, to do a simple test on tokenization model,
> >
>
> Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
> you would go to DL4J first and revert back to OpenNLP if all u want to do
> is basic text processing.
>
> The model files (1.5 models) are presently at -
> http://opennlp.sourceforge.net/models-1.5/
>
>
>
> >
> > InputStream is = new FileInputStream("en-token.bin");
> >
> > Do I have to download the en-token.bin separately? I am working in a
> maven
> > projects. Thank you
>
>
> Yes, the models need to be downloaded separately.
>
> We finally got approval from Apache Foundation to distribute OpenNLP models
> thru Apache, following the upcoming 1.8.1 release we should be distributing
> updated 1.8.1 models too once we hash out the details for doing that.
>
>
> > .
> >
> > Ling
> >
> >
> > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <ko...@gmail.com>
> > wrote:
> >
> > > Long chain, yes, then you probably use the SourceForge tokenization
> > > model that was trained on some old news.
> > >
> > > We usually don't consider mistakes the models do as bugs because we
> > > can't do much about it other than suggesting to use models that fit
> > > your data very well and even in that case models can be wrong
> > > sometimes.
> > >
> > > If there is something we can do here to reduce the error rate then we
> > > are very happy to get that as a contribution or just pointed out.
> > >
> > > Jörn
> > >
> > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> > > > Hi, Jörn:
> > > >
> > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > think.
> > > > And then UIMA uses openNLP. Probably that's what happens.
> > > >
> > > > So it isn't openNLP's original problem? Thank you.
> > > >
> > > > Ling
> > > >
> > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <kottmann@gmail.com
> >
> > > wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> which model are you using? Did you train it yourself?
> > > >>
> > > >> Jörn
> > > >>
> > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> > > >> > Hi, all:
> > > >> >
> > > >> > I am testing openNLP and found some significant tokenization issue
> > > >> > involving punctuation.
> > > >> >
> > > >> > Thank you Costco!
> > > >> > i love costco!
> > > >> > I love Costco!!
> > > >> > FUCK IKEA.
> > > >> >
> > > >> > In all these cases, the last punctuation is not split so "Costco!"
> > and
> > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > problem.
> > > >> > Before I file an issue on OpenNLP project, I want to make sure
> this
> > > issue
> > > >> > is true coming from the library.
> > > >> >
> > > >> > Does any of you encounter similar problem? Thanks.
> > > >>
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Suneel Marthi <sm...@apache.org>.

On Thu, Jun 29, 2017 at 8:07 PM, Ling <li...@gmail.com> wrote:

> Hi, Jörn:
>
> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> included the Maven 1.8 version in my POM file, then do I still need to
> download the models separately? And I can't find those model files. For
> example, to do a simple test on tokenization model,
>

Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
you would go to DL4J first and revert back to OpenNLP if all u want to do
is basic text processing.

The model files (1.5 models) are presently at -
http://opennlp.sourceforge.net/models-1.5/



>
> InputStream is = new FileInputStream("en-token.bin");
>
> Do I have to download the en-token.bin separately? I am working in a maven
> projects. Thank you


Yes, the models need to be downloaded separately.

We finally got approval from Apache Foundation to distribute OpenNLP models
thru Apache, following the upcoming 1.8.1 release we should be distributing
updated 1.8.1 models too once we hash out the details for doing that.


> .
>
> Ling
>
>
> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
> > Long chain, yes, then you probably use the SourceForge tokenization
> > model that was trained on some old news.
> >
> > We usually don't consider mistakes the models do as bugs because we
> > can't do much about it other than suggesting to use models that fit
> > your data very well and even in that case models can be wrong
> > sometimes.
> >
> > If there is something we can do here to reduce the error rate then we
> > are very happy to get that as a contribution or just pointed out.
> >
> > Jörn
> >
> > On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> > > Hi, Jörn:
> > >
> > > I am using a Deeplearning4j, which uses org.apache.uima library I
> think.
> > > And then UIMA uses openNLP. Probably that's what happens.
> > >
> > > So it isn't openNLP's original problem? Thank you.
> > >
> > > Ling
> > >
> > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <ko...@gmail.com>
> > wrote:
> > >
> > >> Hello,
> > >>
> > >> which model are you using? Did you train it yourself?
> > >>
> > >> Jörn
> > >>
> > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> > >> > Hi, all:
> > >> >
> > >> > I am testing openNLP and found some significant tokenization issue
> > >> > involving punctuation.
> > >> >
> > >> > Thank you Costco!
> > >> > i love costco!
> > >> > I love Costco!!
> > >> > FUCK IKEA.
> > >> >
> > >> > In all these cases, the last punctuation is not split so "Costco!"
> and
> > >> > "IKEA." are treated as one token. This looks like a systematic
> > problem.
> > >> > Before I file an issue on OpenNLP project, I want to make sure this
> > issue
> > >> > is true coming from the library.
> > >> >
> > >> > Does any of you encounter similar problem? Thanks.
> > >>
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Gary Underwood <gu...@clinacuity.com>.

The models are separate. They can be downloaded from http://opennlp.sourceforge.net/models-1.5/ <http://opennlp.sourceforge.net/models-1.5/>
Gary Underwood
gunderwood@clinacuity.com



> On Jun 29, 2017, at 8:07 PM, Ling <li...@gmail.com> wrote:
> 
> Hi, Jörn:
> 
> I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> included the Maven 1.8 version in my POM file, then do I still need to
> download the models separately? And I can't find those model files. For
> example, to do a simple test on tokenization model,
> 
> InputStream is = new FileInputStream("en-token.bin");
> 
> Do I have to download the en-token.bin separately? I am working in a maven
> projects. Thank you.
> 
> Ling
> 
> 
> On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <ko...@gmail.com> wrote:
> 
>> Long chain, yes, then you probably use the SourceForge tokenization
>> model that was trained on some old news.
>> 
>> We usually don't consider mistakes the models do as bugs because we
>> can't do much about it other than suggesting to use models that fit
>> your data very well and even in that case models can be wrong
>> sometimes.
>> 
>> If there is something we can do here to reduce the error rate then we
>> are very happy to get that as a contribution or just pointed out.
>> 
>> Jörn
>> 
>> On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
>>> Hi, Jörn:
>>> 
>>> I am using a Deeplearning4j, which uses org.apache.uima library I think.
>>> And then UIMA uses openNLP. Probably that's what happens.
>>> 
>>> So it isn't openNLP's original problem? Thank you.
>>> 
>>> Ling
>>> 
>>> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <ko...@gmail.com>
>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> which model are you using? Did you train it yourself?
>>>> 
>>>> Jörn
>>>> 
>>>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
>>>>> Hi, all:
>>>>> 
>>>>> I am testing openNLP and found some significant tokenization issue
>>>>> involving punctuation.
>>>>> 
>>>>> Thank you Costco!
>>>>> i love costco!
>>>>> I love Costco!!
>>>>> FUCK IKEA.
>>>>> 
>>>>> In all these cases, the last punctuation is not split so "Costco!" and
>>>>> "IKEA." are treated as one token. This looks like a systematic
>> problem.
>>>>> Before I file an issue on OpenNLP project, I want to make sure this
>> issue
>>>>> is true coming from the library.
>>>>> 
>>>>> Does any of you encounter similar problem? Thanks.
>>>> 
>>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Ling <li...@gmail.com>.

Hi, Jörn:

I want to directly use openNLP, instead of deeplearning4j and UIMA. I
included the Maven 1.8 version in my POM file, then do I still need to
download the models separately? And I can't find those model files. For
example, to do a simple test on tokenization model,

InputStream is = new FileInputStream("en-token.bin");

Do I have to download the en-token.bin separately? I am working in a maven
projects. Thank you.

Ling


On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Long chain, yes, then you probably use the SourceForge tokenization
> model that was trained on some old news.
>
> We usually don't consider mistakes the models do as bugs because we
> can't do much about it other than suggesting to use models that fit
> your data very well and even in that case models can be wrong
> sometimes.
>
> If there is something we can do here to reduce the error rate then we
> are very happy to get that as a contribution or just pointed out.
>
> Jörn
>
> On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> > Hi, Jörn:
> >
> > I am using a Deeplearning4j, which uses org.apache.uima library I think.
> > And then UIMA uses openNLP. Probably that's what happens.
> >
> > So it isn't openNLP's original problem? Thank you.
> >
> > Ling
> >
> > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> which model are you using? Did you train it yourself?
> >>
> >> Jörn
> >>
> >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> >> > Hi, all:
> >> >
> >> > I am testing openNLP and found some significant tokenization issue
> >> > involving punctuation.
> >> >
> >> > Thank you Costco!
> >> > i love costco!
> >> > I love Costco!!
> >> > FUCK IKEA.
> >> >
> >> > In all these cases, the last punctuation is not split so "Costco!" and
> >> > "IKEA." are treated as one token. This looks like a systematic
> problem.
> >> > Before I file an issue on OpenNLP project, I want to make sure this
> issue
> >> > is true coming from the library.
> >> >
> >> > Does any of you encounter similar problem? Thanks.
> >>
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Joern Kottmann <ko...@gmail.com>.

Long chain, yes, then you probably use the SourceForge tokenization
model that was trained on some old news.

We usually don't consider mistakes the models do as bugs because we
can't do much about it other than suggesting to use models that fit
your data very well and even in that case models can be wrong
sometimes.

If there is something we can do here to reduce the error rate then we
are very happy to get that as a contribution or just pointed out.

Jörn

On Thu, Jun 29, 2017 at 6:54 PM, Ling <li...@gmail.com> wrote:
> Hi, Jörn:
>
> I am using a Deeplearning4j, which uses org.apache.uima library I think.
> And then UIMA uses openNLP. Probably that's what happens.
>
> So it isn't openNLP's original problem? Thank you.
>
> Ling
>
> On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <ko...@gmail.com> wrote:
>
>> Hello,
>>
>> which model are you using? Did you train it yourself?
>>
>> Jörn
>>
>> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
>> > Hi, all:
>> >
>> > I am testing openNLP and found some significant tokenization issue
>> > involving punctuation.
>> >
>> > Thank you Costco!
>> > i love costco!
>> > I love Costco!!
>> > FUCK IKEA.
>> >
>> > In all these cases, the last punctuation is not split so "Costco!" and
>> > "IKEA." are treated as one token. This looks like a systematic problem.
>> > Before I file an issue on OpenNLP project, I want to make sure this issue
>> > is true coming from the library.
>> >
>> > Does any of you encounter similar problem? Thanks.
>>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Ling <li...@gmail.com>.

Hi, Jörn:

I am using a Deeplearning4j, which uses org.apache.uima library I think.
And then UIMA uses openNLP. Probably that's what happens.

So it isn't openNLP's original problem? Thank you.

Ling

On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> which model are you using? Did you train it yourself?
>
> Jörn
>
> On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> > Hi, all:
> >
> > I am testing openNLP and found some significant tokenization issue
> > involving punctuation.
> >
> > Thank you Costco!
> > i love costco!
> > I love Costco!!
> > FUCK IKEA.
> >
> > In all these cases, the last punctuation is not split so "Costco!" and
> > "IKEA." are treated as one token. This looks like a systematic problem.
> > Before I file an issue on OpenNLP project, I want to make sure this issue
> > is true coming from the library.
> >
> > Does any of you encounter similar problem? Thanks.
>

Re: Is this a typical OpenNLP tokenization issue?

Posted by Joern Kottmann <ko...@gmail.com>.

Hello,

which model are you using? Did you train it yourself?

Jörn

On Thu, Jun 29, 2017 at 4:04 AM, Ling <li...@gmail.com> wrote:
> Hi, all:
>
> I am testing openNLP and found some significant tokenization issue
> involving punctuation.
>
> Thank you Costco!
> i love costco!
> I love Costco!!
> FUCK IKEA.
>
> In all these cases, the last punctuation is not split so "Costco!" and
> "IKEA." are treated as one token. This looks like a systematic problem.
> Before I file an issue on OpenNLP project, I want to make sure this issue
> is true coming from the library.
>
> Does any of you encounter similar problem? Thanks.