You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Paul Cowan <da...@scotalt.net> on 2011/01/05 21:32:54 UTC

Re: Training multiple models

Hi,

I have created a sample file which for now is only marked up
<START:Organization>..<END> markers and I have the following test which is
passing.  Java is not a language I have spent an awful lot of time on so
forgive any ignorance on my part:

    @Test
    public void testHtmlOrganizationFind() throws Exception{
        InputStream in = getClass().getClassLoader().getResourceAsStream(
        "opennlp/tools/namefind/htmlbasic.train");

        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(new InputStreamReader(in))
        );

        TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
"organization",
                sampleStream, Collections.<String, Object>emptyMap(), 70,
1);

        assertNotNull(nameFinderModel);
    }

At the moment, I am preprocessing the htmlbasic.train file by stripping out
all the new line characters so that it is just one line.

I would be grateful if anyone could help me with the following questions:

1.  Is the "type" argument passed into NameFinderME.train method the type of
the model which in my case is organization (<START:organization>)?  If so,
would I need to call train for each tag I mark up the text with? I want to
use <START:location> and others for example.

2.  How do I feed multiple files into the training?  Somebody said I could
use the <HTML> tags as document delimiters. Or is another way to merge all
the documents into 1 file which are delimited by the new line character?  I
cannot find a test which shows how to do this.

Thanks

Paul

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 23 December 2010 16:42, Benson Margulies <bi...@gmail.com> wrote:

> If I were you, I'd keep HTML digestion separate from sentence bounding.
>
>
> On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <da...@scotalt.net> wrote:
> > Hi,
> >
> > Am I right in saying that, I will also need to create and train my own
> HTML
> > sentence detector in order to parse the HTML into chunks that can be
> > tokenised?
> >
> > Cheers
> >
> > Paul Cowan
> >
> > Cutting-Edge Solutions (Scotland)
> >
> > http://thesoftwaresimpleton.blogspot.com/
> >
> >
> >
> > On 17 December 2010 15:10, Jörn Kottmann <ko...@gmail.com> wrote:
> >
> >> On 12/17/10 2:19 PM, James Kosin wrote:
> >>
> >>> I have the following questions that I would appreciate an answer for:
> >>> >
> >>> >  1. Can I have the different name finding tags in the same data?
> >>>
> >>
> >> Yes, but that means you train a model which can detect each of these
> >> names. You should test both, multiple name types in one model,
> >> and separate models for each name type. You can use the built
> >> in evaluation to validate your results.
> >>
> >>  >  2. Does the<START:address>  <END>  make sense over multiple lines or
> >>> should I
> >>> >  break this up further?
> >>>
> >> No not possible, names spanning multiple sentences (a line is a
> sentence),
> >> is not supported.
> >>
> >>
> >>  >  3. I want to use 200 or 300 different examples, do I need to create
> >>> separate
> >>> >  files for each example or can I merge them all into 1 and if it is
> only
> >>> 1,
> >>> >  do I need to mark up the start and end of a file?
> >>>
> >> If you want to use the command line training tool they must be all in
> one
> >> file, if you use the API
> >> its up to you to merge these different sources into one name sample
> stream.
> >>
> >> Jörn
> >>
> >
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

I meant that I have written an HtmlToeknizer and not an html parser.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 31 January 2011 14:52, Paul Cowan <da...@scotalt.net> wrote:

> I have written an html parser which I am using to tokenize an html document
> (new line characters removed) and pass into the find method of NameFinderME.
>
> I am getting good results for some basic model testing on identically
> trained html and sample html (without the <START:organization>...<END> tags
> and different company names).
>
> When it comes to training the model, I am calling the static train method
> of the NameFinderME.
>
> I have noticed that the tokenization of the training data happens in the
> read method of NameSampleDataStream which in turn calls the static parse
> method of NameSample.
>
> This method uses the WhitespaceTokenizer to tokenize.
>
> Am I right in saying that I should be using the same tokenizer for both
> training and finding?
>
> Should I write something to take care of the NameSample object creation
> that uses my HtmlTokenizer or maybe it makes sense to extend the
> NamesampleDataStream to allow for the use of other tokenizers?
>
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 26 January 2011 04:33, Khurram <kh...@gmail.com> wrote:
>
>> i am trying to find out what is the corelation between the amount of
>> training data and the accuracy of find calls. In other words, at what
>> point
>> adding more training data starts to matter less and less and we run into
>> deminishing returns...
>>
>> one more thing: it would be nice to see something like a Statistic object
>> populated after finder.train to see how well you have trained the model.
>>
>> thanks,
>>
>> On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <ko...@gmail.com>
>> wrote:
>>
>> > On 1/25/11 3:22 PM, Paul Cowan wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks for your comments on the JIRA.
>> >>
>> >> Should I be expecting exact results if the training data and the sample
>> >> data
>> >> are exactly the same or is there just too little training data to tell
>> at
>> >> this stage?
>> >>
>> >>
>> > If you are training with a cutoff of 5 then the results might not be
>> > identical,
>> > and even if they are, you want good results on "unkown" data.
>> >
>> > That is why you need a certain a mount of training data to get the model
>> > going.
>> >
>> > When we have natural language text we divide it into sentences to
>> extract a
>> > unit
>> > we can pass on to the name finder. For me it seems that is more
>> difficult
>> > to
>> > get such a unit when working directly on html data. In your case I think
>> > the previous
>> > map feature does not really help. So you could pass a bigger chunk to
>> the
>> > find method than you
>> > usually would do.
>> >
>> > Maybe even an entire page you crawl at a time. But then you need to have
>> > good way of
>> > tokenizing this page, because your tokenization should take the html
>> into
>> > account, having
>> > an html element as a token would make sense in my eyes. But you could
>> also
>> > try to just
>> > use the simple tokenizer and play a little with the feature generation,
>> > e.g. increasing the
>> > window size to 5 or even more.
>> >
>> > After you have this you still need to annotate training data, which
>> might
>> > not be that nice
>> > with our "text" format, because it would mean that you have to place an
>> > entire page into
>> > one line.
>> >
>> > But it should not hard to come up with a new format, then you write a
>> small
>> > parser
>> > and create the NameSample object yourself.
>> >
>> > Hope that helps,
>> > Jörn
>> >
>> >
>>
>
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

I have written an html parser which I am using to tokenize an html document
(new line characters removed) and pass into the find method of NameFinderME.

I am getting good results for some basic model testing on identically
trained html and sample html (without the <START:organization>...<END> tags
and different company names).

When it comes to training the model, I am calling the static train method of
the NameFinderME.

I have noticed that the tokenization of the training data happens in the
read method of NameSampleDataStream which in turn calls the static parse
method of NameSample.

This method uses the WhitespaceTokenizer to tokenize.

Am I right in saying that I should be using the same tokenizer for both
training and finding?

Should I write something to take care of the NameSample object creation that
uses my HtmlTokenizer or maybe it makes sense to extend the
NamesampleDataStream to allow for the use of other tokenizers?

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 26 January 2011 04:33, Khurram <kh...@gmail.com> wrote:

> i am trying to find out what is the corelation between the amount of
> training data and the accuracy of find calls. In other words, at what point
> adding more training data starts to matter less and less and we run into
> deminishing returns...
>
> one more thing: it would be nice to see something like a Statistic object
> populated after finder.train to see how well you have trained the model.
>
> thanks,
>
> On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> > On 1/25/11 3:22 PM, Paul Cowan wrote:
> >
> >> Hi,
> >>
> >> Thanks for your comments on the JIRA.
> >>
> >> Should I be expecting exact results if the training data and the sample
> >> data
> >> are exactly the same or is there just too little training data to tell
> at
> >> this stage?
> >>
> >>
> > If you are training with a cutoff of 5 then the results might not be
> > identical,
> > and even if they are, you want good results on "unkown" data.
> >
> > That is why you need a certain a mount of training data to get the model
> > going.
> >
> > When we have natural language text we divide it into sentences to extract
> a
> > unit
> > we can pass on to the name finder. For me it seems that is more difficult
> > to
> > get such a unit when working directly on html data. In your case I think
> > the previous
> > map feature does not really help. So you could pass a bigger chunk to the
> > find method than you
> > usually would do.
> >
> > Maybe even an entire page you crawl at a time. But then you need to have
> > good way of
> > tokenizing this page, because your tokenization should take the html into
> > account, having
> > an html element as a token would make sense in my eyes. But you could
> also
> > try to just
> > use the simple tokenizer and play a little with the feature generation,
> > e.g. increasing the
> > window size to 5 or even more.
> >
> > After you have this you still need to annotate training data, which might
> > not be that nice
> > with our "text" format, because it would mean that you have to place an
> > entire page into
> > one line.
> >
> > But it should not hard to come up with a new format, then you write a
> small
> > parser
> > and create the NameSample object yourself.
> >
> > Hope that helps,
> > Jörn
> >
> >
>

Re: Training multiple models

Posted by Khurram <kh...@gmail.com>.

i am trying to find out what is the corelation between the amount of
training data and the accuracy of find calls. In other words, at what point
adding more training data starts to matter less and less and we run into
deminishing returns...

one more thing: it would be nice to see something like a Statistic object
populated after finder.train to see how well you have trained the model.

thanks,

On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/25/11 3:22 PM, Paul Cowan wrote:
>
>> Hi,
>>
>> Thanks for your comments on the JIRA.
>>
>> Should I be expecting exact results if the training data and the sample
>> data
>> are exactly the same or is there just too little training data to tell at
>> this stage?
>>
>>
> If you are training with a cutoff of 5 then the results might not be
> identical,
> and even if they are, you want good results on "unkown" data.
>
> That is why you need a certain a mount of training data to get the model
> going.
>
> When we have natural language text we divide it into sentences to extract a
> unit
> we can pass on to the name finder. For me it seems that is more difficult
> to
> get such a unit when working directly on html data. In your case I think
> the previous
> map feature does not really help. So you could pass a bigger chunk to the
> find method than you
> usually would do.
>
> Maybe even an entire page you crawl at a time. But then you need to have
> good way of
> tokenizing this page, because your tokenization should take the html into
> account, having
> an html element as a token would make sense in my eyes. But you could also
> try to just
> use the simple tokenizer and play a little with the feature generation,
> e.g. increasing the
> window size to 5 or even more.
>
> After you have this you still need to annotate training data, which might
> not be that nice
> with our "text" format, because it would mean that you have to place an
> entire page into
> one line.
>
> But it should not hard to come up with a new format, then you write a small
> parser
> and create the NameSample object yourself.
>
> Hope that helps,
> Jörn
>
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

That is a big help.

I will try a few of these ideas out and see how I get on.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 25 January 2011 14:41, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/25/11 3:22 PM, Paul Cowan wrote:
>
>> Hi,
>>
>> Thanks for your comments on the JIRA.
>>
>> Should I be expecting exact results if the training data and the sample
>> data
>> are exactly the same or is there just too little training data to tell at
>> this stage?
>>
>>
> If you are training with a cutoff of 5 then the results might not be
> identical,
> and even if they are, you want good results on "unkown" data.
>
> That is why you need a certain a mount of training data to get the model
> going.
>
> When we have natural language text we divide it into sentences to extract a
> unit
> we can pass on to the name finder. For me it seems that is more difficult
> to
> get such a unit when working directly on html data. In your case I think
> the previous
> map feature does not really help. So you could pass a bigger chunk to the
> find method than you
> usually would do.
>
> Maybe even an entire page you crawl at a time. But then you need to have
> good way of
> tokenizing this page, because your tokenization should take the html into
> account, having
> an html element as a token would make sense in my eyes. But you could also
> try to just
> use the simple tokenizer and play a little with the feature generation,
> e.g. increasing the
> window size to 5 or even more.
>
> After you have this you still need to annotate training data, which might
> not be that nice
> with our "text" format, because it would mean that you have to place an
> entire page into
> one line.
>
> But it should not hard to come up with a new format, then you write a small
> parser
> and create the NameSample object yourself.
>
> Hope that helps,
> Jörn
>
>

Re: Training multiple models

Posted by Jörn Kottmann <ko...@gmail.com>.

On 1/25/11 3:22 PM, Paul Cowan wrote:
> Hi,
>
> Thanks for your comments on the JIRA.
>
> Should I be expecting exact results if the training data and the sample data
> are exactly the same or is there just too little training data to tell at
> this stage?
>

If you are training with a cutoff of 5 then the results might not be 
identical,
and even if they are, you want good results on "unkown" data.

That is why you need a certain a mount of training data to get the model 
going.

When we have natural language text we divide it into sentences to 
extract a unit
we can pass on to the name finder. For me it seems that is more difficult to
get such a unit when working directly on html data. In your case I think 
the previous
map feature does not really help. So you could pass a bigger chunk to 
the find method than you
usually would do.

Maybe even an entire page you crawl at a time. But then you need to have 
good way of
tokenizing this page, because your tokenization should take the html 
into account, having
an html element as a token would make sense in my eyes. But you could 
also try to just
use the simple tokenizer and play a little with the feature generation, 
e.g. increasing the
window size to 5 or even more.

After you have this you still need to annotate training data, which 
might not be that nice
with our "text" format, because it would mean that you have to place an 
entire page into
one line.

But it should not hard to come up with a new format, then you write a 
small parser
and create the NameSample object yourself.

Hope that helps,
Jörn

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

Hi,

Thanks for your comments on the JIRA.

Should I be expecting exact results if the training data and the sample data
are exactly the same or is there just too little training data to tell at
this stage?

I think having a model trained from html would be very useful.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 19 January 2011 20:42, Paul Cowan <da...@scotalt.net> wrote:

> I have created a JIRA issue which contains a sample html and a failing
> test.
>
> https://issues.apache.org/jira/browse/OPENNLP-67
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 13 January 2011 10:21, Paul Cowan <da...@scotalt.net> wrote:
>
>> >> Open a new jira issue, either just attach a piece of test data or
>> contribute a patch which also contains the additions to the unit tests.
>>
>> I will do that.
>>
>>
>> Cheers
>>
>> Paul Cowan
>>
>> Cutting-Edge Solutions (Scotland)
>>
>> http://thesoftwaresimpleton.blogspot.com/
>>
>>
>>
>> On 13 January 2011 10:15, Jörn Kottmann <ko...@gmail.com> wrote:
>>
>>> On 1/13/11 10:55 AM, Paul Cowan wrote:
>>>
>>>> Maybe you can contribute
>>>>>>
>>>>>  a small sample of your training data to the project so we can
>>>> add a junit test.
>>>>
>>>> I will gladly do that.  how is the best way to do that?  I believe the
>>>> source control is moving.
>>>>
>>>> Is git an option or mercurial?  Pull requests are great for this type of
>>>> thing through github or the mercurial equivalent.  I will make the model
>>>> available for HTML parsing when it is finished also.
>>>>
>>>
>>> Even when you do not have issues it would be nice to have a small html
>>> test.
>>>
>>> The code is already moved to the Apache repository, even our website  has
>>> a checkout instructions:
>>> http://incubator.apache.org/opennlp/source-code.html
>>>
>>> Open a new jira issue, either just attach a piece of test data or
>>> contribute
>>> a patch which also contains the additions to the unit tests.
>>>
>>> Thanks,
>>> Jörn
>>>
>>>
>>>
>>
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

I have created a JIRA issue which contains a sample html and a failing test.


https://issues.apache.org/jira/browse/OPENNLP-67

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 10:21, Paul Cowan <da...@scotalt.net> wrote:

> >> Open a new jira issue, either just attach a piece of test data or
> contribute a patch which also contains the additions to the unit tests.
>
> I will do that.
>
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 13 January 2011 10:15, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 1/13/11 10:55 AM, Paul Cowan wrote:
>>
>>> Maybe you can contribute
>>>>>
>>>>  a small sample of your training data to the project so we can
>>> add a junit test.
>>>
>>> I will gladly do that.  how is the best way to do that?  I believe the
>>> source control is moving.
>>>
>>> Is git an option or mercurial?  Pull requests are great for this type of
>>> thing through github or the mercurial equivalent.  I will make the model
>>> available for HTML parsing when it is finished also.
>>>
>>
>> Even when you do not have issues it would be nice to have a small html
>> test.
>>
>> The code is already moved to the Apache repository, even our website  has
>> a checkout instructions:
>> http://incubator.apache.org/opennlp/source-code.html
>>
>> Open a new jira issue, either just attach a piece of test data or
>> contribute
>> a patch which also contains the additions to the unit tests.
>>
>> Thanks,
>> Jörn
>>
>>
>>
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

>> Open a new jira issue, either just attach a piece of test data or
contribute a patch which also contains the additions to the unit tests.

I will do that.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 10:15, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/13/11 10:55 AM, Paul Cowan wrote:
>
>> Maybe you can contribute
>>>>
>>>  a small sample of your training data to the project so we can
>> add a junit test.
>>
>> I will gladly do that.  how is the best way to do that?  I believe the
>> source control is moving.
>>
>> Is git an option or mercurial?  Pull requests are great for this type of
>> thing through github or the mercurial equivalent.  I will make the model
>> available for HTML parsing when it is finished also.
>>
>
> Even when you do not have issues it would be nice to have a small html
> test.
>
> The code is already moved to the Apache repository, even our website  has
> a checkout instructions:
> http://incubator.apache.org/opennlp/source-code.html
>
> Open a new jira issue, either just attach a piece of test data or
> contribute
> a patch which also contains the additions to the unit tests.
>
> Thanks,
> Jörn
>
>
>

Re: Training multiple models

Posted by Jörn Kottmann <ko...@gmail.com>.

On 1/13/11 10:55 AM, Paul Cowan wrote:
>>> Maybe you can contribute
>   a small sample of your training data to the project so we can
> add a junit test.
>
> I will gladly do that.  how is the best way to do that?  I believe the
> source control is moving.
>
> Is git an option or mercurial?  Pull requests are great for this type of
> thing through github or the mercurial equivalent.  I will make the model
> available for HTML parsing when it is finished also.

Even when you do not have issues it would be nice to have a small html test.

The code is already moved to the Apache repository, even our website  has
a checkout instructions:
http://incubator.apache.org/opennlp/source-code.html

Open a new jira issue, either just attach a piece of test data or 
contribute
a patch which also contains the additions to the unit tests.

Thanks,
Jörn

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

>>  do you have an issue with the parser and your html training data ?

Sorry, misread this the first time, I do not have any issue with the parser.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 09:55, Paul Cowan <da...@scotalt.net> wrote:

> >> Maybe you can contribute
>  a small sample of your training data to the project so we can
> add a junit test.
>
> I will gladly do that.  how is the best way to do that?  I believe the
> source control is moving.
>
> Is git an option or mercurial?  Pull requests are great for this type of
> thing through github or the mercurial equivalent.  I will make the model
> available for HTML parsing when it is finished also.
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 13 January 2011 09:32, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> On 1/7/11 7:51 AM, Paul Cowan wrote:
>>
>>> I've had a dig at the source to try and answer my own questions
>>>
>>> Could somebody confirm that this is the correct way to delimit multiple
>>> documents for training?
>>>
>>> An example of which would be:
>>>
>>> <html><body><p>example</p><h2>  <START:organization>Orgainization
>>> One<END>
>>> </h2>  ........</body></html>
>>>
>>> <html><body><p>example</p><h2>  <START:organization>Orgainization
>>> Two<END>
>>> </h2>  ........</body></html>
>>>
>>> That is, I have a blank line between each document which will ensure that
>>> clearAdaptiveData is called?
>>>
>>>  Yes, a blank line indicates a new document in the training data, which
>> clears
>> all adaptive data on the feature generators. The only adaptive feature
>> generator
>> is currently the previous map feature generator.
>>
>>  My code to train the model is:
>>>
>>>  @Test
>>>     public void testHtmlOrganizationFind() throws Exception{
>>>         InputStream in = getClass().getClassLoader().getResourceAsStream(
>>>         "opennlp/tools/namefind/htmlbasic.train");
>>>
>>>         ObjectStream<NameSample>  sampleStream = new
>>> NameSampleDataStream(
>>>                 new PlainTextByLineStream(new InputStreamReader(in))
>>>         );
>>>
>>>         TokenNameFinderModel nameFinderModel;
>>>
>>>         nameFinderModel = NameFinderME.train("en", "organization",
>>>                 sampleStream, Collections.<String, Object>emptyMap());
>>>
>>>         try{
>>>             sampleStream.close();
>>>         }
>>>         catch (IOException ioe){
>>>         }
>>>
>>>         File modelOutFile = new
>>>
>>> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
>>>
>>>         if(modelOutFile.exists()){
>>>             try{
>>>                 modelOutFile.delete();
>>>             }
>>>             catch (Exception ex){
>>>             }
>>>         }
>>>
>>>         OutputStream modelOut = null;
>>>
>>>         try{
>>>             modelOut = new BufferedOutputStream(new
>>> FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
>>>             nameFinderModel.serialize(modelOut);
>>>         }catch (IOException ioe){
>>>             System.err.println("failed");
>>>             System.err.println("Error during writing model file: " +
>>> ioe.getMessage());
>>>         }finally {
>>>             if(modelOut != null){
>>>                 try{
>>>                     modelOut.close();
>>>                 }catch(IOException ioe){
>>>                   System.err.println("Failed to properly close model
>>> file: " +
>>>                       ioe.getMessage());
>>>                 }
>>>             }
>>>         }
>>>
>>>         assert(modelOutFile.exists());
>>>     }
>>>
>>>
>>>
>> Your training code looks good to me, do you have an issue with
>> the parser and your html training data ? Maybe you can contribute
>>  a small sample of your training data to the project so we can
>> add a junit test.
>>
>> Jörn
>>
>
>

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

>> Maybe you can contribute
 a small sample of your training data to the project so we can
add a junit test.

I will gladly do that.  how is the best way to do that?  I believe the
source control is moving.

Is git an option or mercurial?  Pull requests are great for this type of
thing through github or the mercurial equivalent.  I will make the model
available for HTML parsing when it is finished also.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 13 January 2011 09:32, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/7/11 7:51 AM, Paul Cowan wrote:
>
>> I've had a dig at the source to try and answer my own questions
>>
>> Could somebody confirm that this is the correct way to delimit multiple
>> documents for training?
>>
>> An example of which would be:
>>
>> <html><body><p>example</p><h2>  <START:organization>Orgainization One<END>
>> </h2>  ........</body></html>
>>
>> <html><body><p>example</p><h2>  <START:organization>Orgainization Two<END>
>> </h2>  ........</body></html>
>>
>> That is, I have a blank line between each document which will ensure that
>> clearAdaptiveData is called?
>>
>>  Yes, a blank line indicates a new document in the training data, which
> clears
> all adaptive data on the feature generators. The only adaptive feature
> generator
> is currently the previous map feature generator.
>
>  My code to train the model is:
>>
>>  @Test
>>     public void testHtmlOrganizationFind() throws Exception{
>>         InputStream in = getClass().getClassLoader().getResourceAsStream(
>>         "opennlp/tools/namefind/htmlbasic.train");
>>
>>         ObjectStream<NameSample>  sampleStream = new NameSampleDataStream(
>>                 new PlainTextByLineStream(new InputStreamReader(in))
>>         );
>>
>>         TokenNameFinderModel nameFinderModel;
>>
>>         nameFinderModel = NameFinderME.train("en", "organization",
>>                 sampleStream, Collections.<String, Object>emptyMap());
>>
>>         try{
>>             sampleStream.close();
>>         }
>>         catch (IOException ioe){
>>         }
>>
>>         File modelOutFile = new
>>
>> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
>>
>>         if(modelOutFile.exists()){
>>             try{
>>                 modelOutFile.delete();
>>             }
>>             catch (Exception ex){
>>             }
>>         }
>>
>>         OutputStream modelOut = null;
>>
>>         try{
>>             modelOut = new BufferedOutputStream(new
>> FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
>>             nameFinderModel.serialize(modelOut);
>>         }catch (IOException ioe){
>>             System.err.println("failed");
>>             System.err.println("Error during writing model file: " +
>> ioe.getMessage());
>>         }finally {
>>             if(modelOut != null){
>>                 try{
>>                     modelOut.close();
>>                 }catch(IOException ioe){
>>                   System.err.println("Failed to properly close model file:
>> " +
>>                       ioe.getMessage());
>>                 }
>>             }
>>         }
>>
>>         assert(modelOutFile.exists());
>>     }
>>
>>
>>
> Your training code looks good to me, do you have an issue with
> the parser and your html training data ? Maybe you can contribute
>  a small sample of your training data to the project so we can
> add a junit test.
>
> Jörn
>

Re: Training multiple models

Posted by Jörn Kottmann <ko...@gmail.com>.

On 1/7/11 7:51 AM, Paul Cowan wrote:
> I've had a dig at the source to try and answer my own questions
>
> Could somebody confirm that this is the correct way to delimit multiple
> documents for training?
>
> An example of which would be:
>
> <html><body><p>example</p><h2>  <START:organization>Orgainization One<END>
> </h2>  ........</body></html>
>
> <html><body><p>example</p><h2>  <START:organization>Orgainization Two<END>
> </h2>  ........</body></html>
>
> That is, I have a blank line between each document which will ensure that
> clearAdaptiveData is called?
>
Yes, a blank line indicates a new document in the training data, which 
clears
all adaptive data on the feature generators. The only adaptive feature 
generator
is currently the previous map feature generator.
> My code to train the model is:
>
>   @Test
>      public void testHtmlOrganizationFind() throws Exception{
>          InputStream in = getClass().getClassLoader().getResourceAsStream(
>          "opennlp/tools/namefind/htmlbasic.train");
>
>          ObjectStream<NameSample>  sampleStream = new NameSampleDataStream(
>                  new PlainTextByLineStream(new InputStreamReader(in))
>          );
>
>          TokenNameFinderModel nameFinderModel;
>
>          nameFinderModel = NameFinderME.train("en", "organization",
>                  sampleStream, Collections.<String, Object>emptyMap());
>
>          try{
>              sampleStream.close();
>          }
>          catch (IOException ioe){
>          }
>
>          File modelOutFile = new
> File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");
>
>          if(modelOutFile.exists()){
>              try{
>                  modelOutFile.delete();
>              }
>              catch (Exception ex){
>              }
>          }
>
>          OutputStream modelOut = null;
>
>          try{
>              modelOut = new BufferedOutputStream(new
> FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
>              nameFinderModel.serialize(modelOut);
>          }catch (IOException ioe){
>              System.err.println("failed");
>              System.err.println("Error during writing model file: " +
> ioe.getMessage());
>          }finally {
>              if(modelOut != null){
>                  try{
>                      modelOut.close();
>                  }catch(IOException ioe){
>                    System.err.println("Failed to properly close model file: " +
>                        ioe.getMessage());
>                  }
>              }
>          }
>
>          assert(modelOutFile.exists());
>      }
>
>

Your training code looks good to me, do you have an issue with
the parser and your html training data ? Maybe you can contribute
  a small sample of your training data to the project so we can
add a junit test.

Jörn

Re: Training multiple models

Posted by Paul Cowan <da...@scotalt.net>.

I've had a dig at the source to try and answer my own questions

Could somebody confirm that this is the correct way to delimit multiple
documents for training?

An example of which would be:

<html><body><p>example</p><h2> <START:organization>Orgainization One <END>
</h2> ........ </body></html>

<html><body><p>example</p><h2> <START:organization>Orgainization Two <END>
</h2> ........ </body></html>

That is, I have a blank line between each document which will ensure that
clearAdaptiveData is called?

My code to train the model is:

 @Test
    public void testHtmlOrganizationFind() throws Exception{
        InputStream in = getClass().getClassLoader().getResourceAsStream(
        "opennlp/tools/namefind/htmlbasic.train");

        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(new InputStreamReader(in))
        );

        TokenNameFinderModel nameFinderModel;

        nameFinderModel = NameFinderME.train("en", "organization",
                sampleStream, Collections.<String, Object>emptyMap());

        try{
            sampleStream.close();
        }
        catch (IOException ioe){
        }

        File modelOutFile = new
File("/Users/paulcowan/projects/leadcapturer/models/en-ner-organization.bin");

        if(modelOutFile.exists()){
            try{
                modelOutFile.delete();
            }
            catch (Exception ex){
            }
        }

        OutputStream modelOut = null;

        try{
            modelOut = new BufferedOutputStream(new
FileOutputStream(modelOutFile), IO_BUFFER_SIZE);
            nameFinderModel.serialize(modelOut);
        }catch (IOException ioe){
            System.err.println("failed");
            System.err.println("Error during writing model file: " +
ioe.getMessage());
        }finally {
            if(modelOut != null){
                try{
                    modelOut.close();
                }catch(IOException ioe){
                  System.err.println("Failed to properly close model file: " +
                      ioe.getMessage());
                }
            }
        }

        assert(modelOutFile.exists());
    }


Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 5 January 2011 20:32, Paul Cowan <da...@scotalt.net> wrote:

> Hi,
>
> I have created a sample file which for now is only marked up
> <START:Organization>..<END> markers and I have the following test which is
> passing.  Java is not a language I have spent an awful lot of time on so
> forgive any ignorance on my part:
>
>     @Test
>     public void testHtmlOrganizationFind() throws Exception{
>         InputStream in = getClass().getClassLoader().getResourceAsStream(
>         "opennlp/tools/namefind/htmlbasic.train");
>
>         ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
>                 new PlainTextByLineStream(new InputStreamReader(in))
>         );
>
>         TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
> "organization",
>                 sampleStream, Collections.<String, Object>emptyMap(), 70,
> 1);
>
>         assertNotNull(nameFinderModel);
>     }
>
> At the moment, I am preprocessing the htmlbasic.train file by stripping out
> all the new line characters so that it is just one line.
>
> I would be grateful if anyone could help me with the following questions:
>
> 1.  Is the "type" argument passed into NameFinderME.train method the type
> of the model which in my case is organization (<START:organization>)?  If
> so, would I need to call train for each tag I mark up the text with? I want
> to use <START:location> and others for example.
>
> 2.  How do I feed multiple files into the training?  Somebody said I could
> use the <HTML> tags as document delimiters. Or is another way to merge all
> the documents into 1 file which are delimited by the new line character?  I
> cannot find a test which shows how to do this.
>
> Thanks
>
> Paul
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 23 December 2010 16:42, Benson Margulies <bi...@gmail.com> wrote:
>
>> If I were you, I'd keep HTML digestion separate from sentence bounding.
>>
>>
>> On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <da...@scotalt.net> wrote:
>> > Hi,
>> >
>> > Am I right in saying that, I will also need to create and train my own
>> HTML
>> > sentence detector in order to parse the HTML into chunks that can be
>> > tokenised?
>> >
>> > Cheers
>> >
>> > Paul Cowan
>> >
>> > Cutting-Edge Solutions (Scotland)
>> >
>> > http://thesoftwaresimpleton.blogspot.com/
>> >
>> >
>> >
>> > On 17 December 2010 15:10, Jörn Kottmann <ko...@gmail.com> wrote:
>> >
>> >> On 12/17/10 2:19 PM, James Kosin wrote:
>> >>
>> >>> I have the following questions that I would appreciate an answer for:
>> >>> >
>> >>> >  1. Can I have the different name finding tags in the same data?
>> >>>
>> >>
>> >> Yes, but that means you train a model which can detect each of these
>> >> names. You should test both, multiple name types in one model,
>> >> and separate models for each name type. You can use the built
>> >> in evaluation to validate your results.
>> >>
>> >>  >  2. Does the<START:address>  <END>  make sense over multiple lines
>> or
>> >>> should I
>> >>> >  break this up further?
>> >>>
>> >> No not possible, names spanning multiple sentences (a line is a
>> sentence),
>> >> is not supported.
>> >>
>> >>
>> >>  >  3. I want to use 200 or 300 different examples, do I need to create
>> >>> separate
>> >>> >  files for each example or can I merge them all into 1 and if it is
>> only
>> >>> 1,
>> >>> >  do I need to mark up the start and end of a file?
>> >>>
>> >> If you want to use the command line training tool they must be all in
>> one
>> >> file, if you use the API
>> >> its up to you to merge these different sources into one name sample
>> stream.
>> >>
>> >> Jörn
>> >>
>> >
>>
>
>