You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Daniel Russ <da...@netscape.net> on 2015/11/06 17:22:32 UTC

requirements information extraction

Hello,

    I am trying to extract requirement from a large set of documents (sorry can’t be more specific). The documents are split into sentences. 

I have a small training set that has been annotated in a format similar to the NameFinder

I think that you should <START:requirement> eat more chicken <END> .  

I can get fair results with the NameFinder, which is actually surprisingly considering I just used the opennlp CLI application with my data.  I would like to customize the features for my needs. 

I have spent some time looking into the NameFinder code.  It appears that a lot of the code is required to integrate with the opennlp CLI application.  That is not a requirement for me.  

It appears that the minimum I need to create is an equivalent to a NameSample; a cachedFeatureGenerator containing a set of adaptiveFeatureGenerators including my custom featureGenerator (extend FeatureGeneratoryAdapter); an objectStream of Samples; and an eventStream (extend AbstractEventStream)

A few questions
1) did I get all the parts needed?  The xxxME (NameFinderME) appears to wrap all the training functionality for the application, but does not seem to be a requirement.  All the various factories appear to be a requirement to work with the opennlp CLI application.
2) What exactly is adaptive about the adaptive data.  The contextGenerators add all the predicates to the context at each index.  Do I clear the adaptiveData at the end of each sentence?
3) The NameFinder does not appear to use the BeamSearch, by default it creates a GIS object and trains using that.  I think that the beam search would be better for me, because it keep multiple potential local potential outcomes to improve the global classification.  Am I correct?

BTW: The example online https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen <https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen> gives an example with a depreciated method. The only non-deprecated NameFinderME.train method uses a TokenNameFinderFactory which doesn’t have a method to set the context generators (except via XML).

Thank you for any advice
Dan

Re: requirements information extraction

Posted by Rodrigo Agerri <ra...@apache.org>.
Hello Daniel,

The NameFinderME.train(String lang, String type,
ObjectStream<NameSample>, TrainingParameters, TokenNameFinderFactory)
method requires a language, a the training samples, a
TrainingParameters object and a TokenNameFinderFactory.

If you do not specify a type, it will train for every type found in the data.

To create a TokenNameFinderFactory use TokenNameFinderFactory
create(String subclassName, byte[] featureGeneratorBytes, final
Map<String, Object> resources, SequenceCodec<String> seqCodec).

If you like to use your own featureset you can to pass them in a
byte[] array which can be constructed from the String representation
of a XML feature generator.

The beamsize is read from the TrainingParameters and failing that it
defaults to 3.

In the NameSampleDataStream class you can see that the features are
cleared each time an empty line is encountered.

HTH,

R

On Fri, Nov 6, 2015 at 5:22 PM, Daniel Russ <da...@netscape.net> wrote:
> Hello,
>
>     I am trying to extract requirement from a large set of documents (sorry can’t be more specific). The documents are split into sentences.
>
> I have a small training set that has been annotated in a format similar to the NameFinder
>
> I think that you should <START:requirement> eat more chicken <END> .
>
> I can get fair results with the NameFinder, which is actually surprisingly considering I just used the opennlp CLI application with my data.  I would like to customize the features for my needs.
>
> I have spent some time looking into the NameFinder code.  It appears that a lot of the code is required to integrate with the opennlp CLI application.  That is not a requirement for me.
>
> It appears that the minimum I need to create is an equivalent to a NameSample; a cachedFeatureGenerator containing a set of adaptiveFeatureGenerators including my custom featureGenerator (extend FeatureGeneratoryAdapter); an objectStream of Samples; and an eventStream (extend AbstractEventStream)
>
> A few questions
> 1) did I get all the parts needed?  The xxxME (NameFinderME) appears to wrap all the training functionality for the application, but does not seem to be a requirement.  All the various factories appear to be a requirement to work with the opennlp CLI application.
> 2) What exactly is adaptive about the adaptive data.  The contextGenerators add all the predicates to the context at each index.  Do I clear the adaptiveData at the end of each sentence?
> 3) The NameFinder does not appear to use the BeamSearch, by default it creates a GIS object and trains using that.  I think that the beam search would be better for me, because it keep multiple potential local potential outcomes to improve the global classification.  Am I correct?
>
> BTW: The example online https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen <https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen> gives an example with a depreciated method. The only non-deprecated NameFinderME.train method uses a TokenNameFinderFactory which doesn’t have a method to set the context generators (except via XML).
>
> Thank you for any advice
> Dan