You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by andrea maestroni <an...@gmail.com> on 2012/08/23 15:15:52 UTC

Document Categorizer

Hi to all!

i try to develop a program in java that take a document,extract the text ,analyze the text and extract the main topic of the document.

i think it 's a problem of document categorizer right?

i tried the example in the  manual page.

i have create the training file,i rtf file with the line:

GMDecrease Major acquisitions that have a lower gross margin than the existing network also \ 
           had a negative impact on the overall gross margin, but it should improve following \ 
           the implementation of its integration strategies .
GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \
           to obligations towards dealers .
then in my code i use this function for training a model:

public static void Train() throws InvalidFormatException, IOException {
        
        DoccatModel model = null;

        InputStream dataIn = null;
        try {
            dataIn = new FileInputStream("/Users/andry85mae/Desktop/apache-opennlp-1.5.2-incubating/bin/train.train");
            ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

            model = DocumentCategorizerME.train("en", sampleStream);
        } catch (IOException e) {
            // Failed to read or parse training data, training failed
            e.printStackTrace();
        } finally {
            if (dataIn != null) {
                try {
                    dataIn.close();
                } catch (IOException e) {
                    // Not an issue, training already finished.
                    // The exception should be logged and investigated
                    // if part of a production system.
                    e.printStackTrace();
                }
            }
            }
      
    }

but i give me an error...

java.io.IOException: Empty lines, or lines with only a category string are not allowed!
	Computing event counts...  Incorporating indexed data for training...  
Exception in thread "main" java.lang.NullPointerException
	at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
	at opennlp.maxent.GIS.trainModel(GIS.java:256)
	at opennlp.model.TrainUtil.train(TrainUtil.java:182)
	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:154)
	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:176)
	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:207)
	at opennlp_prova.Opennlp_prova.Train(Opennlp_prova.java:55)
	at opennlp_prova.Opennlp_prova.main(Opennlp_prova.java:96)
Java Result: 1

what are the error?

thank in advance!!!


Re: Document Categorizer

Posted by Jörn Kottmann <ko...@gmail.com>.
The actual issue he encounters is this:
https://issues.apache.org/jira/browse/OPENNLP-122

Many people encounter it when they try the doccat component,
and therefore there is a jira issue for that two:
https://issues.apache.org/jira/browse/OPENNLP-488

We should probably change the documentation, and provide
sample instructions which actually work and not just there to illustrate
how it could work with more data.

Jörn

On 08/24/2012 02:01 PM, Lance Norskog wrote:
> Please file a JIRA.
>
> On Thu, Aug 23, 2012 at 8:12 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>> Just continue to add more data and also add a couple of samples per
>> category.
>> The exception you see is well known and indicates that the trainer does not
>> see
>> enough training samples.
>>
>> We will fix the exception with the next bigger release and print an error
>> message
>> instead.
>>
>> Jörn
>>
>>
>> On 08/23/2012 04:51 PM, andrea maestroni wrote:
>>>    trie to adding some line at the train file...like this:
>>>
>>> GMDecrease Major acquisitions that have a lower gross margin than the
>>> existing network also \
>>>              had a negative impact on the overall gross margin, but it
>>> should improve following \
>>>              the implementation of its integration strategies .
>>> GMIncrease The upward movement of gross margin resulted from amounts
>>> pursuant to adjustments \
>>>              to obligations towards dealers .
>>> Caquetoire A caquetoire is an armchair with simply turned legs \
>>>                   It has curved arms, but the shape of the seat is what
>>> really distinguishes it \
>>>                   It was designed to be very wide in the front, and
>>> narrowed at the back, making a triangular shape \
>>>                   The back was high and panelled, and sometimes was
>>> decorated with carving and medallions.
>>> Furniture If you want to buy outdoor furniture, you must give it the same
>>> consideration as buying indoor furniture \
>>>                  After all, your objective is the same: you want a
>>> comfortable and attractive space where you can relax or entertain \
>>> Style Figure out which style should dominate \
>>>           It can be a modern space with antique accents or a traditional
>>> space with contemporary accents \
>>>           Letting one style dominate is crucial \
>>>           because you don’t want to create a space where everything is
>>> fighting for equal attention.
>>> ..........
>>> ..........
>>> ..........
>>> Wood    All wood construction simply means that all parts are made of wood
>>> \
>>>          However, the piece of furniture may include some combination of
>>> solid wood and engineered wood \
>>>          An artificially laminated surface consists of plastic, foil or
>>> paper that is printed with a wood grain pattern \
>>>          This is then bonded to a composite such as particleboard or medium
>>> density fiberboard \
>>>          Engineered wood \
>>>          There are two kinds of engineered wood: plywood and particleboard,
>>> which is also called fiberboard.
>>>
>>> Home_Safety So many furniture related accidents happen inside the home,
>>> that it makes sense to take a look at these indoor safety tips \
>>>                    A lot of indoor accidents involve children, so make sure
>>> to go over your safety rules with them as well.
>>>
>>> in total there are 10 categories...
>>>
>>> but there are the same error...
>>>
>>> Il giorno 23/ago/2012, alle ore 16.18, Jörn Kottmann ha scritto:
>>>
>>>> On 08/23/2012 03:30 PM, andrea maestroni wrote:
>>>>> so i must add some line to the train file? or adding other file?
>>>>> there are some example for the file and for the classification?
>>>> The problem here is that the default training does a feature
>>>> cutoff of 5. So a feature must be seen at least 5 times to be included
>>>> in the training. With just two training samples you do not get to 5,
>>>> it should not crash if you set the cutoff to 0.
>>>>
>>>> But in the end the model will really be able to predict anything with
>>>> just two training samples. Usually you want to train with at least a few
>>>> hundred
>>>> or thousands of samples.
>>>>
>>>> You need to add more lines to the training file. Each line is one
>>>> document, starting
>>>> with the category, just like in the sample you experimented with.
>>>>
>>>> Jörn
>>>
>
>


Re: Document Categorizer

Posted by Lance Norskog <go...@gmail.com>.
Please file a JIRA.

On Thu, Aug 23, 2012 at 8:12 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> Just continue to add more data and also add a couple of samples per
> category.
> The exception you see is well known and indicates that the trainer does not
> see
> enough training samples.
>
> We will fix the exception with the next bigger release and print an error
> message
> instead.
>
> Jörn
>
>
> On 08/23/2012 04:51 PM, andrea maestroni wrote:
>>
>>   trie to adding some line at the train file...like this:
>>
>> GMDecrease Major acquisitions that have a lower gross margin than the
>> existing network also \
>>             had a negative impact on the overall gross margin, but it
>> should improve following \
>>             the implementation of its integration strategies .
>> GMIncrease The upward movement of gross margin resulted from amounts
>> pursuant to adjustments \
>>             to obligations towards dealers .
>> Caquetoire A caquetoire is an armchair with simply turned legs \
>>                  It has curved arms, but the shape of the seat is what
>> really distinguishes it \
>>                  It was designed to be very wide in the front, and
>> narrowed at the back, making a triangular shape \
>>                  The back was high and panelled, and sometimes was
>> decorated with carving and medallions.
>> Furniture If you want to buy outdoor furniture, you must give it the same
>> consideration as buying indoor furniture \
>>                 After all, your objective is the same: you want a
>> comfortable and attractive space where you can relax or entertain \
>> Style Figure out which style should dominate \
>>          It can be a modern space with antique accents or a traditional
>> space with contemporary accents \
>>          Letting one style dominate is crucial \
>>          because you don’t want to create a space where everything is
>> fighting for equal attention.
>> ..........
>> ..........
>> ..........
>> Wood    All wood construction simply means that all parts are made of wood
>> \
>>         However, the piece of furniture may include some combination of
>> solid wood and engineered wood \
>>         An artificially laminated surface consists of plastic, foil or
>> paper that is printed with a wood grain pattern \
>>         This is then bonded to a composite such as particleboard or medium
>> density fiberboard \
>>         Engineered wood \
>>         There are two kinds of engineered wood: plywood and particleboard,
>> which is also called fiberboard.
>>
>> Home_Safety So many furniture related accidents happen inside the home,
>> that it makes sense to take a look at these indoor safety tips \
>>                   A lot of indoor accidents involve children, so make sure
>> to go over your safety rules with them as well.
>>
>> in total there are 10 categories...
>>
>> but there are the same error...
>>
>> Il giorno 23/ago/2012, alle ore 16.18, Jörn Kottmann ha scritto:
>>
>>> On 08/23/2012 03:30 PM, andrea maestroni wrote:
>>>>
>>>> so i must add some line to the train file? or adding other file?
>>>> there are some example for the file and for the classification?
>>>
>>> The problem here is that the default training does a feature
>>> cutoff of 5. So a feature must be seen at least 5 times to be included
>>> in the training. With just two training samples you do not get to 5,
>>> it should not crash if you set the cutoff to 0.
>>>
>>> But in the end the model will really be able to predict anything with
>>> just two training samples. Usually you want to train with at least a few
>>> hundred
>>> or thousands of samples.
>>>
>>> You need to add more lines to the training file. Each line is one
>>> document, starting
>>> with the category, just like in the sample you experimented with.
>>>
>>> Jörn
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Document Categorizer

Posted by Jörn Kottmann <ko...@gmail.com>.
Just continue to add more data and also add a couple of samples per 
category.
The exception you see is well known and indicates that the trainer does 
not see
enough training samples.

We will fix the exception with the next bigger release and print an 
error message
instead.

Jörn

On 08/23/2012 04:51 PM, andrea maestroni wrote:
>   trie to adding some line at the train file...like this:
>
> GMDecrease Major acquisitions that have a lower gross margin than the existing network also \
>             had a negative impact on the overall gross margin, but it should improve following \
>             the implementation of its integration strategies .
> GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \
>             to obligations towards dealers .
> Caquetoire A caquetoire is an armchair with simply turned legs \
> 		 It has curved arms, but the shape of the seat is what really distinguishes it \
> 		 It was designed to be very wide in the front, and narrowed at the back, making a triangular shape \
> 		 The back was high and panelled, and sometimes was decorated with carving and medallions.
> Furniture If you want to buy outdoor furniture, you must give it the same consideration as buying indoor furniture \
> 		After all, your objective is the same: you want a comfortable and attractive space where you can relax or entertain \
> Style Figure out which style should dominate \
> 	 It can be a modern space with antique accents or a traditional space with contemporary accents \
> 	 Letting one style dominate is crucial \
> 	 because you don’t want to create a space where everything is fighting for equal attention.
> ..........
> ..........
> ..........
> Wood	All wood construction simply means that all parts are made of wood \
> 	However, the piece of furniture may include some combination of solid wood and engineered wood \
> 	An artificially laminated surface consists of plastic, foil or paper that is printed with a wood grain pattern \
> 	This is then bonded to a composite such as particleboard or medium density fiberboard \
> 	Engineered wood \
> 	There are two kinds of engineered wood: plywood and particleboard, which is also called fiberboard.
>
> Home_Safety So many furniture related accidents happen inside the home, that it makes sense to take a look at these indoor safety tips \
> 		  A lot of indoor accidents involve children, so make sure to go over your safety rules with them as well.
>
> in total there are 10 categories...
>
> but there are the same error...
>
> Il giorno 23/ago/2012, alle ore 16.18, Jörn Kottmann ha scritto:
>
>> On 08/23/2012 03:30 PM, andrea maestroni wrote:
>>> so i must add some line to the train file? or adding other file?
>>> there are some example for the file and for the classification?
>> The problem here is that the default training does a feature
>> cutoff of 5. So a feature must be seen at least 5 times to be included
>> in the training. With just two training samples you do not get to 5,
>> it should not crash if you set the cutoff to 0.
>>
>> But in the end the model will really be able to predict anything with
>> just two training samples. Usually you want to train with at least a few hundred
>> or thousands of samples.
>>
>> You need to add more lines to the training file. Each line is one document, starting
>> with the category, just like in the sample you experimented with.
>>
>> Jörn
>


Re: Document Categorizer

Posted by andrea maestroni <an...@gmail.com>.
 trie to adding some line at the train file...like this:

GMDecrease Major acquisitions that have a lower gross margin than the existing network also \ 
           had a negative impact on the overall gross margin, but it should improve following \ 
           the implementation of its integration strategies .
GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \
           to obligations towards dealers .
Caquetoire A caquetoire is an armchair with simply turned legs \
		 It has curved arms, but the shape of the seat is what really distinguishes it \
		 It was designed to be very wide in the front, and narrowed at the back, making a triangular shape \
		 The back was high and panelled, and sometimes was decorated with carving and medallions.
Furniture If you want to buy outdoor furniture, you must give it the same consideration as buying indoor furniture \
		After all, your objective is the same: you want a comfortable and attractive space where you can relax or entertain \
Style Figure out which style should dominate \
	 It can be a modern space with antique accents or a traditional space with contemporary accents \
	 Letting one style dominate is crucial \
	 because you don’t want to create a space where everything is fighting for equal attention.
..........
..........
..........
Wood	All wood construction simply means that all parts are made of wood \
	However, the piece of furniture may include some combination of solid wood and engineered wood \
	An artificially laminated surface consists of plastic, foil or paper that is printed with a wood grain pattern \
	This is then bonded to a composite such as particleboard or medium density fiberboard \
	Engineered wood \
	There are two kinds of engineered wood: plywood and particleboard, which is also called fiberboard.

Home_Safety So many furniture related accidents happen inside the home, that it makes sense to take a look at these indoor safety tips \
		  A lot of indoor accidents involve children, so make sure to go over your safety rules with them as well.

in total there are 10 categories...

but there are the same error...

Il giorno 23/ago/2012, alle ore 16.18, Jörn Kottmann ha scritto:

> On 08/23/2012 03:30 PM, andrea maestroni wrote:
>> so i must add some line to the train file? or adding other file?
>> there are some example for the file and for the classification?
> 
> The problem here is that the default training does a feature
> cutoff of 5. So a feature must be seen at least 5 times to be included
> in the training. With just two training samples you do not get to 5,
> it should not crash if you set the cutoff to 0.
> 
> But in the end the model will really be able to predict anything with
> just two training samples. Usually you want to train with at least a few hundred
> or thousands of samples.
> 
> You need to add more lines to the training file. Each line is one document, starting
> with the category, just like in the sample you experimented with.
> 
> Jörn


Re: Document Categorizer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 08/23/2012 03:30 PM, andrea maestroni wrote:
> so i must add some line to the train file? or adding other file?
> there are some example for the file and for the classification?

The problem here is that the default training does a feature
cutoff of 5. So a feature must be seen at least 5 times to be included
in the training. With just two training samples you do not get to 5,
it should not crash if you set the cutoff to 0.

But in the end the model will really be able to predict anything with
just two training samples. Usually you want to train with at least a few 
hundred
or thousands of samples.

You need to add more lines to the training file. Each line is one 
document, starting
with the category, just like in the sample you experimented with.

Jörn

Re: Document Categorizer

Posted by andrea maestroni <an...@gmail.com>.
thanks!

so i must add some line to the train file? or adding other file?
there are some example for the file and for the classification?


sorry i am new of opennlp :)

Il giorno 23/ago/2012, alle ore 15.21, Jörn Kottmann ha scritto:

> The error is thrown because you do not have enough training samples,
> try to run your code with at least 10 to 20 training samples.
> 
> Jörn
> 
> On 08/23/2012 03:15 PM, andrea maestroni wrote:
>> Hi to all!
>> 
>> i try to develop a program in java that take a document,extract the text ,analyze the text and extract the main topic of the document.
>> 
>> i think it 's a problem of document categorizer right?
>> 
>> i tried the example in the  manual page.
>> 
>> i have create the training file,i rtf file with the line:
>> 
>> GMDecrease Major acquisitions that have a lower gross margin than the existing network also \
>>            had a negative impact on the overall gross margin, but it should improve following \
>>            the implementation of its integration strategies .
>> GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \
>>            to obligations towards dealers .
>> then in my code i use this function for training a model:
>> 
>> public static void Train() throws InvalidFormatException, IOException {
>>                  DoccatModel model = null;
>> 
>>         InputStream dataIn = null;
>>         try {
>>             dataIn = new FileInputStream("/Users/andry85mae/Desktop/apache-opennlp-1.5.2-incubating/bin/train.train");
>>             ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
>>             ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
>> 
>>             model = DocumentCategorizerME.train("en", sampleStream);
>>         } catch (IOException e) {
>>             // Failed to read or parse training data, training failed
>>             e.printStackTrace();
>>         } finally {
>>             if (dataIn != null) {
>>                 try {
>>                     dataIn.close();
>>                 } catch (IOException e) {
>>                     // Not an issue, training already finished.
>>                     // The exception should be logged and investigated
>>                     // if part of a production system.
>>                     e.printStackTrace();
>>                 }
>>             }
>>             }
>>            }
>> 
>> but i give me an error...
>> 
>> java.io.IOException: Empty lines, or lines with only a category string are not allowed!
>> 	Computing event counts...  Incorporating indexed data for training...
>> Exception in thread "main" java.lang.NullPointerException
>> 	at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
>> 	at opennlp.maxent.GIS.trainModel(GIS.java:256)
>> 	at opennlp.model.TrainUtil.train(TrainUtil.java:182)
>> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:154)
>> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:176)
>> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:207)
>> 	at opennlp_prova.Opennlp_prova.Train(Opennlp_prova.java:55)
>> 	at opennlp_prova.Opennlp_prova.main(Opennlp_prova.java:96)
>> Java Result: 1
>> 
>> what are the error?
>> 
>> thank in advance!!!
>> 
>> 
> 


Re: Document Categorizer

Posted by Jörn Kottmann <ko...@gmail.com>.
The error is thrown because you do not have enough training samples,
try to run your code with at least 10 to 20 training samples.

Jörn

On 08/23/2012 03:15 PM, andrea maestroni wrote:
> Hi to all!
>
> i try to develop a program in java that take a document,extract the text ,analyze the text and extract the main topic of the document.
>
> i think it 's a problem of document categorizer right?
>
> i tried the example in the  manual page.
>
> i have create the training file,i rtf file with the line:
>
> GMDecrease Major acquisitions that have a lower gross margin than the existing network also \
>             had a negative impact on the overall gross margin, but it should improve following \
>             the implementation of its integration strategies .
> GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \
>             to obligations towards dealers .
> then in my code i use this function for training a model:
>
> public static void Train() throws InvalidFormatException, IOException {
>          
>          DoccatModel model = null;
>
>          InputStream dataIn = null;
>          try {
>              dataIn = new FileInputStream("/Users/andry85mae/Desktop/apache-opennlp-1.5.2-incubating/bin/train.train");
>              ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
>              ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
>
>              model = DocumentCategorizerME.train("en", sampleStream);
>          } catch (IOException e) {
>              // Failed to read or parse training data, training failed
>              e.printStackTrace();
>          } finally {
>              if (dataIn != null) {
>                  try {
>                      dataIn.close();
>                  } catch (IOException e) {
>                      // Not an issue, training already finished.
>                      // The exception should be logged and investigated
>                      // if part of a production system.
>                      e.printStackTrace();
>                  }
>              }
>              }
>        
>      }
>
> but i give me an error...
>
> java.io.IOException: Empty lines, or lines with only a category string are not allowed!
> 	Computing event counts...  Incorporating indexed data for training...
> Exception in thread "main" java.lang.NullPointerException
> 	at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
> 	at opennlp.maxent.GIS.trainModel(GIS.java:256)
> 	at opennlp.model.TrainUtil.train(TrainUtil.java:182)
> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:154)
> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:176)
> 	at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:207)
> 	at opennlp_prova.Opennlp_prova.Train(Opennlp_prova.java:55)
> 	at opennlp_prova.Opennlp_prova.main(Opennlp_prova.java:96)
> Java Result: 1
>
> what are the error?
>
> thank in advance!!!
>
>