You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by William Colen <co...@apache.org> on 2011/01/03 19:52:15 UTC

Re: Training Chunker

Hi, Daniel,

Sorry for the late reply.
I'll work on a draft for that page, and will check how to train a Portuguese
chunker using Bosque. It will take some days to finish, but I'll come back
as soon as I have something.

Regards
William

On Tue, Dec 28, 2010 at 4:39 AM, daniel gatis <da...@gmail.com> wrote:

> Hi everyone,
> I want to train a portuguese chunker with bosque corpora, but the wiki
> topic
> about this is blank.
> So how can i train a portuguese chunker?
>

Re: Training Chunker

Posted by "Eraldo R. Fernandes" <er...@gmail.com>.

On Fri, Jan 7, 2011 at 7:30 PM, William Colen <co...@apache.org> wrote:
>
> Maybe the poor performance we got before was related to Amazonia.ad, which
> is an unrevised automatically generated corpus. The problem with
> Bosque_CF_8.0 is that it is too small (< 10k sentences).
>

I also think so. Another point is that Amazonia's texts are from a
very different domain (in fact, many different domains). You can try
to use Selva corpora (a part of it, maybe) that is "shallow"-revised
data.

-- 
Eraldo R. Fernandes
http://eraldoluis.pro.br

Re: Training Chunker

Posted by William Colen <co...@apache.org>.

Good news!

After some modifications and using another corpus we could have nicer
results:

Precision: 0.9413606010016694
Recall: 0.9379938451301671
F-Measure: 0.9396742073907428

For these results I used the Corpus
Bosque_CF_8.0.ad<http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz>to
perform a 10-fold cross validation.
Maybe the poor performance we got before was related to Amazonia.ad, which
is an unrevised automatically generated corpus. The problem with
Bosque_CF_8.0 is that it is too small (< 10k sentences).

Regards
William

On Fri, Jan 7, 2011 at 1:25 AM, William Colen <co...@apache.org> wrote:

> Hi Daniel,
>
> I have some news. I wrote a tool to extract the chunk information from
> Bosque AD format and create OpenNLP train data. It is not working nicely
> yet, but it is a point to start from.
> I'm getting the following results:
>
> Precision: 0.7680814205283673
> Recall: 0.8237343241987923
> F-Measure: 0.7949350067234425
>
> Which is bad if compared with the ones we get using the English data. The
> problem is probably the heuristic to extract chunk information.
>
> What I did:
> 1. As described in that PUC-Rio paper: "Defined as chunk all consecutive
> tokens within the same deepest-level phrase."
> 2. I'm considering the group forms described at section 2.1 of Floresta
> Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html>
>
> Here is a sample:
>
> AD format:
> STA:cu
> =CJT:fcl
> ==ADVL:adv("depois" <left>)    depois
> ==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC)    se
> ==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND
> VFIN)    encontram
> ==PIV:pp
> ===H:prp("com" <right>)    com
> ===P<:np
> ====>N:art("o" <artd> DET F S)    a
> ====H:n("dissidência" <np-def> <ac> <am> F S)    dissidência
> ====N<:pp
> =====H:prp("de" <sam-> <np-close>)    de
> =====P<:np
> ======>N:art("o" <artd> <-sam> DET M S)    o
> ======H:n("grupo" <np-def> <HH> M S)    grupo
> ======,
> ======APP:np
> =======>N:art("o" <artd> DET M P)    os
> =======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P)
> Bacamarteiros_de_Pinga_Fogo
> =,
> =CO:conj-c("e" <co-fin> <co-fmc>)    e
> =CJT:x
> ==SUBJ:np
> ===>N:art("o" <artd> DET F S)    a
> ===H:n("festa" <np-def> <occ> <left> F S)    festa
> ==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN)    continua
> ==ADVL:pp
> ===H:prp("por" <right>)    por
> ===P<:n("muito_tempo" <np-idf> <dur> M S)    muito_tempo
> .
>
> Result:
>
> depois adv O
> se pron-pers O
> encontram v-fin B-VP
> com prp B-PP
> a art B-NP
> dissidência n I-NP
> de prp B-PP
> o art B-NP
> grupo n I-NP
> , , I-NP
> os art B-NP
> Bacamarteiros_de_Pinga_Fogo prop I-NP
> , , O
> e conj-c O
> a art B-NP
> festa n I-NP
> continua v-fin B-VP
> por prp B-PP
> muito_tempo n I-PP
> . . O
>
> The code to perform the conversion is
> opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk)
>
> Follow the instructions if you want to reproduce the experiment and check
> the results.
>
> A. Prepare the environment
>
> - Get the code from SVN trunk, as described here:
> http://incubator.apache.org/opennlp/source-code.html
> - You will need Maven 3.0.1 to compile the project, if you don't have it
> yet, please get it from http://maven.apache.org/download.html, the
> installation instructions are at the same page.
> - Compile the project. To do that go to the folder <project-root>/opennlp/
> from command line and run the command "mvn install". It can take longer to
> execute the first build.
> - Now go to the folder <project-root>/opennlp-tools
> - Execute the command:
>    mvn dependency:copy-dependencies -DoutputDirectory="lib"
> to copy the libraries to lib folder
> - Copy the file
> <project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar
> to <project-root>/opennlp-tools
> - Now we are ready to execute Apache OpenNLP!
>
> B. Use the ChunkConverter
>
> - Download the Amazonia Corpus from Bosque and extract somewhere:
> http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
> - Now we have to split the Corpus to a size we can handle. I could count
> almost 271000 sentences in the Corpus, since my computer can't handle this
> amount of sentence, I'll extract the first 2000 for evaluation, and the next
> 20000 for training.
> - Create the evaluation data
>    bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
> ../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval
> you will see some "Couldn't parse leaf" messages. A few leafs was not
> following the expected format. We will have to check how to handle it later.
> - Create the train data
>    bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
> ../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train
> you can check the results and verify if they are consistent.
>
> C. Train
> - Execute the command:
>    bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data
> amazonia-chunk.train -model pt-chunker.bin
>
> D. Evaluation
> - Execute the command
>    bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model
> pt-chunker.bin
>
> Regards,
> William
>

Re: Training Chunker

Posted by William Colen <co...@apache.org>.

Hi Daniel,

I have some news. I wrote a tool to extract the chunk information from
Bosque AD format and create OpenNLP train data. It is not working nicely
yet, but it is a point to start from.
I'm getting the following results:

Precision: 0.7680814205283673
Recall: 0.8237343241987923
F-Measure: 0.7949350067234425

Which is bad if compared with the ones we get using the English data. The
problem is probably the heuristic to extract chunk information.

What I did:
1. As described in that PUC-Rio paper: "Defined as chunk all consecutive
tokens within the same deepest-level phrase."
2. I'm considering the group forms described at section 2.1 of Floresta
Symbolset <http://beta.visl.sdu.dk/visl/pt/info/symbolset-floresta.html>

Here is a sample:

AD format:
STA:cu
=CJT:fcl
==ADVL:adv("depois" <left>)    depois
==ACC-PASS:pron-pers("se" <coll> <left> M 3P ACC)    se
==P:v-fin("encontrar" <se-passive> <nosubj> <cjt-head> <fmc> <mv> PR 3P IND
VFIN)    encontram
==PIV:pp
===H:prp("com" <right>)    com
===P<:np
====>N:art("o" <artd> DET F S)    a
====H:n("dissidência" <np-def> <ac> <am> F S)    dissidência
====N<:pp
=====H:prp("de" <sam-> <np-close>)    de
=====P<:np
======>N:art("o" <artd> <-sam> DET M S)    o
======H:n("grupo" <np-def> <HH> M S)    grupo
======,
======APP:np
=======>N:art("o" <artd> DET M P)    os
=======H:prop("Bacamarteiros_de_Pinga_Fogo" <org> <np-close> M P)
Bacamarteiros_de_Pinga_Fogo
=,
=CO:conj-c("e" <co-fin> <co-fmc>)    e
=CJT:x
==SUBJ:np
===>N:art("o" <artd> DET F S)    a
===H:n("festa" <np-def> <occ> <left> F S)    festa
==P:v-fin("continuar" <cjt-sta> <fmc> <mv> PR 3S IND VFIN)    continua
==ADVL:pp
===H:prp("por" <right>)    por
===P<:n("muito_tempo" <np-idf> <dur> M S)    muito_tempo
.

Result:

depois adv O
se pron-pers O
encontram v-fin B-VP
com prp B-PP
a art B-NP
dissidência n I-NP
de prp B-PP
o art B-NP
grupo n I-NP
, , I-NP
os art B-NP
Bacamarteiros_de_Pinga_Fogo prop I-NP
, , O
e conj-c O
a art B-NP
festa n I-NP
continua v-fin B-VP
por prp B-PP
muito_tempo n I-PP
. . O

The code to perform the conversion is
opennlp.tools.formats.ADChunkSampleStream (only at SVN Trunk)

Follow the instructions if you want to reproduce the experiment and check
the results.

A. Prepare the environment

- Get the code from SVN trunk, as described here:
http://incubator.apache.org/opennlp/source-code.html
- You will need Maven 3.0.1 to compile the project, if you don't have it
yet, please get it from http://maven.apache.org/download.html, the
installation instructions are at the same page.
- Compile the project. To do that go to the folder <project-root>/opennlp/
from command line and run the command "mvn install". It can take longer to
execute the first build.
- Now go to the folder <project-root>/opennlp-tools
- Execute the command:
   mvn dependency:copy-dependencies -DoutputDirectory="lib"
to copy the libraries to lib folder
- Copy the file
<project-root>/opennlp-tools/target/opennlp-tools-1.5.1-incubating-SNAPSHOT.jar
to <project-root>/opennlp-tools
- Now we are ready to execute Apache OpenNLP!

B. Use the ChunkConverter

- Download the Amazonia Corpus from Bosque and extract somewhere:
http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
- Now we have to split the Corpus to a size we can handle. I could count
almost 271000 sentences in the Corpus, since my computer can't handle this
amount of sentence, I'll extract the first 2000 for evaluation, and the next
20000 for training.
- Create the evaluation data
   bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
../../../corpus/amazonia.ad -start 0 -end 2000 > amazonia-chunk.eval
you will see some "Couldn't parse leaf" messages. A few leafs was not
following the expected format. We will have to check how to handle it later.
- Create the train data
   bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data
../../../corpus/amazonia.ad -start 2001 -end 22000 > amazonia-chunk.train
you can check the results and verify if they are consistent.

C. Train
- Execute the command:
   bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang pt -data
amazonia-chunk.train -model pt-chunker.bin

D. Evaluation
- Execute the command
   bin/opennlp ChunkerEvaluator -data amazonia-chunk.eval -model
pt-chunker.bin

Regards,
William

Re: Training Chunker

Posted by William Colen <co...@apache.org>.

Hi Daniel,

On Wed, Jan 5, 2011 at 7:29 PM, daniel gatis <da...@gmail.com> wrote:

> Hey collen,
> I will read this paper too and this wiki<http://www.learn.inf.puc-rio.br/index.php/Welcome_to_LEARN_homepage%21> may
> be help and i can try contact some one at http://www.inf.puc-rio.br/ to
> help.
>

It would be great if they could make available the training Corpus described
in chapter 2.

Thanks,
William

Re: Training Chunker

Posted by daniel gatis <da...@gmail.com>.

Hey collen,
I will read this paper too and this
wiki<http://www.learn.inf.puc-rio.br/index.php/Welcome_to_LEARN_homepage!>
may
be help and i can try contact some one at http://www.inf.puc-rio.br/ to
help.

thanks

On Wed, Jan 5, 2011 at 6:02 PM, William Colen <co...@apache.org> wrote:

> Hi Daniel,
>
> Now we have a Chunk page at Wiki:
> https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Chunker
>
> Now we need to check how to extract chunk information from Bosque Corpus.
>
> We have a small sample of Amazonia.AD here:
>
> http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/formats/ad.sample?view=markup
> .
> The AD format has phrases, but not chunks. I'm thinking about using
> the
> heuristic proposed by A Machine Learning Approach to Portuguese Clause
> Identification<
> http://webscience.org.br/wiki/images/f/f9/Clause-propor2010.pdf>to
> do that.
>
> Thanks,
> William
>
> On Mon, Jan 3, 2011 at 10:18 PM, daniel gatis <da...@gmail.com>
> wrote:
>
> > Yeah! This year begun very well ;)
> >
> > thank you.
> >
> > On Mon, Jan 3, 2011 at 3:52 PM, William Colen <co...@apache.org> wrote:
> >
> > > Hi, Daniel,
> > >
> > > Sorry for the late reply.
> > > I'll work on a draft for that page, and will check how to train a
> > > Portuguese
> > > chunker using Bosque. It will take some days to finish, but I'll come
> > back
> > > as soon as I have something.
> > >
> > > Regards
> > > William
> > >
> > >
> > >
> > > On Tue, Dec 28, 2010 at 4:39 AM, daniel gatis <da...@gmail.com>
> > > wrote:
> > >
> > > > Hi everyone,
> > > > I want to train a portuguese chunker with bosque corpora, but the
> wiki
> > > > topic
> > > > about this is blank.
> > > > So how can i train a portuguese chunker?
> > > >
> > >
> >
>

Re: Training Chunker

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi Jörn,

Yes, I'll do it.

On Thu, Jan 6, 2011 at 8:59 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/5/11 10:02 PM, William Colen wrote:
>
>> Hi Daniel,
>>
>> Now we have a Chunk page at Wiki:
>> https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Chunker
>>
>>
> We started a docbook documentation project, would you mind to put
> this documentation inside this docbook ? I already migrated
> the documentation which is in the wiki.
>
> Jörn
>

Re: Training Chunker

Posted by Jörn Kottmann <ko...@gmail.com>.

On 1/5/11 10:02 PM, William Colen wrote:
> Hi Daniel,
>
> Now we have a Chunk page at Wiki:
> https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Chunker
>

We started a docbook documentation project, would you mind to put
this documentation inside this docbook ? I already migrated
the documentation which is in the wiki.

Jörn

Re: Training Chunker

Posted by William Colen <co...@apache.org>.

Hi Daniel,

Now we have a Chunk page at Wiki:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Chunker

Now we need to check how to extract chunk information from Bosque Corpus.

We have a small sample of Amazonia.AD here:
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/formats/ad.sample?view=markup.
The AD format has phrases, but not chunks. I'm thinking about using
the
heuristic proposed by A Machine Learning Approach to Portuguese Clause
Identification<http://webscience.org.br/wiki/images/f/f9/Clause-propor2010.pdf>to
do that.

Thanks,
William

On Mon, Jan 3, 2011 at 10:18 PM, daniel gatis <da...@gmail.com> wrote:

> Yeah! This year begun very well ;)
>
> thank you.
>
> On Mon, Jan 3, 2011 at 3:52 PM, William Colen <co...@apache.org> wrote:
>
> > Hi, Daniel,
> >
> > Sorry for the late reply.
> > I'll work on a draft for that page, and will check how to train a
> > Portuguese
> > chunker using Bosque. It will take some days to finish, but I'll come
> back
> > as soon as I have something.
> >
> > Regards
> > William
> >
> >
> >
> > On Tue, Dec 28, 2010 at 4:39 AM, daniel gatis <da...@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > > I want to train a portuguese chunker with bosque corpora, but the wiki
> > > topic
> > > about this is blank.
> > > So how can i train a portuguese chunker?
> > >
> >
>

Re: Training Chunker

Posted by daniel gatis <da...@gmail.com>.

Yeah! This year begun very well ;)

thank you.

On Mon, Jan 3, 2011 at 3:52 PM, William Colen <co...@apache.org> wrote:

> Hi, Daniel,
>
> Sorry for the late reply.
> I'll work on a draft for that page, and will check how to train a
> Portuguese
> chunker using Bosque. It will take some days to finish, but I'll come back
> as soon as I have something.
>
> Regards
> William
>
>
>
> On Tue, Dec 28, 2010 at 4:39 AM, daniel gatis <da...@gmail.com>
> wrote:
>
> > Hi everyone,
> > I want to train a portuguese chunker with bosque corpora, but the wiki
> > topic
> > about this is blank.
> > So how can i train a portuguese chunker?
> >
>