You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Cristian Petroaca <cr...@gmail.com> on 2015/10/12 14:24:46 UTC

Re: Word Sense Disambiguator

Hi,

Thanks Anthony for the info.
Does anybody else know when the WSD component will be merged into trunk and
possibly cut a release with it?

Thanks

On Sat, Sep 19, 2015 at 9:21 AM, Anthony Beylerian <
anthony.beylerian@gmail.com> wrote:

> Hey Cristian,
>
> Sorry for the late reply, I am currently on summer break but will get back
> on it in one-two weeks.
>
> Can't really say when there will be a new release.
> This usually involves other components as well and it might take time to
> vote.
>
> However, some things to expect for the WSD component:
>
> - Support for the different types of classifiers for the supervised
> approaches (right now only ME based).
> - Support for augmenting the general domain training with specific domain
> information.
>
> Best,
>
> Anthony
>
>
> On Thu, Sep 17, 2015 at 11:18 PM, Cristian Petroaca <
> cristian.petroaca@gmail.com> wrote:
>
> > Hi Anthony,
> >
> > Do you know when will the WSD component be available in an OpenNLP
> release?
> >
> > Thanks,
> > Cristian
> >
> > On Thu, Sep 10, 2015 at 10:32 AM, Cristian Petroaca <
> > cristian.petroaca@gmail.com> wrote:
> >
> > > Yes, that's what I was looking for.
> > > Thanks Aliaksandr.
> > >
> > > On Wed, Sep 9, 2015 at 9:39 PM, Aliaksandr Autayeu <
> > aliaksandr@autayeu.com
> > > > wrote:
> > >
> > >> Cristian, the reference you gave basically uses synset offsets - 1740
> is
> > >> entity, 1930 is physical entity, etc. However, in YAGO they seems to
> > have
> > >> added 100000000 to those offsets.
> > >>
> > >> Synset offset is the fastest way to get into WordNet dictionary,
> because
> > >> it
> > >> is a direct file offset. Offset alone is not enough though, you also
> > need
> > >> POS - part of speech. Speed probably is the reason most people access
> > >> WordNet this way. However, offset is not the best "key", especially
> for
> > >> indexing, because offsets change as WordNet evolves. SenseKeys (e.g.
> > >> bank%1:14:00::
> > >> and bank%1:21:01::) should be more suitable for indexing.
> > >>
> > >> If you're looking to connect with YAGO above, you might do something
> > along
> > >> the lines of
> > >> ....getWordBySenseKey(sensekey).getSynset().getOffset and then add
> > >> 100000000
> > >> to get the YAGO ids.
> > >>
> > >> Aliaksandr
> > >>
> > >>
> > >> On 9 September 2015 at 09:51, Cristian Petroaca <
> > >> cristian.petroaca@gmail.com
> > >> > wrote:
> > >>
> > >> > I am looking for the Sense Id of the word. It has this format here :
> > >> >
> > >> >
> > >>
> >
> http://resources.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoWordnetIds.txt
> > >> >
> > >> >
> > >> > On Tue, Sep 8, 2015 at 6:47 PM, Anthony Beylerian <
> > >> > anthony.beylerian@gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > Thanks it is still being improved.
> > >> > >
> > >> > > I am not sure what you mean by type or database ID.
> > >> > > Currently the sense source and the sense ID are returned.
> > >> > >
> > >> > > For example:
> > >> > >
> > >> > > "I went to the bank to deposit money."
> > >> > > target : bank (index : 4)
> > >> > > expected output : [WORDNET bank%1:14:00:: 21.6, WORDNET
> > bank%1:21:01::
> > >> > > 5.8,... etc]
> > >> > >
> > >> > > Where "bank%1:14:00::" is a SenseKey which you can query WordNet
> > with
> > >> to
> > >> > > give you a sense definition.
> > >> > >
> > >> > > You can do this using the default dictionary :
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> Dictionary.getDefaultResourceInstance().getWordBySenseKey(sensekey).getSynset().getGloss();
> > >> > >
> > >> > > Hope this is what you are looking for, otherwise please clarify.
> > >> > >
> > >> > > Anthony Beylerian
> > >> > >
> > >> > > On Tue, Sep 8, 2015 at 5:34 PM, Cristian Petroaca <
> > >> > > cristian.petroaca@gmail.com> wrote:
> > >> > >
> > >> > > > Hi Anthony,
> > >> > > >
> > >> > > > I had a chance to test the wsd component. That's great work.
> > Thanks.
> > >> > > > One question, is it possible to return the wordnet type (or
> > database
> > >> > id)
> > >> > > of
> > >> > > > the disambiguated word?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Cristian
> > >> > > >
> > >> > > > On Fri, Jul 24, 2015 at 1:14 PM, Anthony Beylerian <
> > >> > > > anthonybeylerian@hotmail.com> wrote:
> > >> > > >
> > >> > > > > Hi,
> > >> > > > >
> > >> > > > > To try out the ongoing implementations, after checking out the
> > >> > sandbox
> > >> > > > > repository please try these steps :
> > >> > > > > 1- Create a resource models directory:
> > >> > > > >
> > >> > > > > - src
> > >> > > > >   - test
> > >> > > > >     - resources
> > >> > > > >       + models
> > >> > > > >
> > >> > > > > 2- Include the following pre-trained models and dictionary in
> > that
> > >> > > > > directory:
> > >> > > > > You can find those here [1] if you like or pre-train your own
> > >> models.
> > >> > > > >
> > >> > > > > {
> > >> > > > > en-token.bin,
> > >> > > > > en-pos-maxent.bin,
> > >> > > > > en-sent.bin,en-ner-person.bin,en-lemmatizer.dict
> > >> > > > > }
> > >> > > > >
> > >> > > > > As to train the IMS approach you need to include training data
> > >> like
> > >> > > > > senseval3 [2]:
> > >> > > > > For now, please add these folders :
> > >> > > > > - src
> > >> > > > >   - test
> > >> > > > >     - resources
> > >> > > > >        - supervised
> > >> > > > >          + raw
> > >> > > > >          + models
> > >> > > > >          + dictionary
> > >> > > > >
> > >> > > > > You can find the data files here [2].
> > >> > > > >
> > >> > > > > 3- We included two examples [LeskTester.java] and
> > [IMSTester.java]
> > >> > that
> > >> > > > > you can run directly, or make your own tests.
> > >> > > > >
> > >> > > > > To run a custom test, minimally you need to have a tokenized
> > text
> > >> or
> > >> > > > > sentence  for example for Lesk:
> > >> > > > >
> > >> > > > >           1>> String[] words =
> > >> > > Loader.getTokenizer().tokenize(sentence);
> > >> > > > >
> > >> > > > > Chose the index of the word to disambiguate in the token
> array.
> > >> > > > >
> > >> > > > >           2>> int wordIndex= 6;
> > >> > > > >
> > >> > > > > Then just create a WSDisambiguator object for example for
> Lesk :
> > >> > > > >
> > >> > > > >          3>> Lesk lesk = new Lesk();
> > >> > > > >
> > >> > > > > And you can call the default disambiguation method
> > >> > > > >
> > >> > > > >          4>> lesk.disambiguate(words,wordIndex);
> > >> > > > >
> > >> > > > > You will get an array of strings with the following format :
> > >> > > > >
> > >> > > > > Lesk : [Source SenseKey Score]
> > >> > > > >
> > >> > > > > To read the sense definitions you can use the method :
> > >> > > > > [opennlp.tools.disambiguator.Constants.printResults]
> > >> > > > >
> > >> > > > > For using the variations of Lesk, you will need to create and
> > >> > > configure a
> > >> > > > > parameters object:
> > >> > > > >           5>> LeskParameters leskParams = new
> LeskParameters();
> > >> > > > > 6>>
> > >> > > > >
> > >> > >
> > >>
> leskParams.setLeskType(LeskParameters.LESK_TYPE.LESK_BASIC_CTXT_WIN_BF);
> > >> > > > >       7>> leskParams.setWin_b_size(4);          8>>
> > >> > > > > leskParams.setDepth(3);          9>>
> lesk.setParams(leskParams);
> > >> > > > >
> > >> > > > > Typically, IMS should perform better than Lesk, since Lesk is
> a
> > >> > classic
> > >> > > > > method but it usually used as a baseline along with the most
> > >> frequent
> > >> > > > sense
> > >> > > > > (MFS).
> > >> > > > > However, we will be testing and adding more techniques.
> > >> > > > >
> > >> > > > > In any case, please feel free to ask for more details.
> > >> > > > >
> > >> > > > > Best,
> > >> > > > >
> > >> > > > > Anthony
> > >> > > > >
> > >> > > > > [1] :
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://drive.google.com/folderview?id=0B67Iu3pf6WucfjdYNGhDc3hkTXd1a3FORnNUYzd3dV9YeWlyMFczeHU0SE1TcWwyU1lhZFU&usp=sharing
> > >> > > > > [2] :
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://drive.google.com/file/d/0ByL0dmKXzHVfSXA3SVZiMnVfOGc/view?usp=sharing
> > >> > > > > > Date: Fri, 24 Jul 2015 09:54:02 +0200
> > >> > > > > > Subject: Re: Word Sense Disambiguator
> > >> > > > > > From: kottmann@gmail.com
> > >> > > > > > To: dev@opennlp.apache.org
> > >> > > > > >
> > >> > > > > > It would be nice if you could share instructions on how to
> run
> > >> it.
> > >> > > > > > I also would like to give it a try.
> > >> > > > > >
> > >> > > > > > Jörn
> > >> > > > > >
> > >> > > > > > On Fri, Jul 24, 2015 at 4:54 AM, Anthony Beylerian <
> > >> > > > > > anthonybeylerian@hotmail.com> wrote:
> > >> > > > > >
> > >> > > > > > > Hello,
> > >> > > > > > > Yes for the moment we are only using WordNet for sense
> > >> > > > definitions.The
> > >> > > > > > > plan is to complete the package by mid to late August, but
> > if
> > >> you
> > >> > > > like
> > >> > > > > you
> > >> > > > > > > can follow up on the progress from the sandbox.
> > >> > > > > > > Best regards,
> > >> > > > > > > Anthony
> > >> > > > > > > > Date: Thu, 23 Jul 2015 15:36:57 +0300
> > >> > > > > > > > Subject: Word Sense Disambiguator
> > >> > > > > > > > From: cristian.petroaca@gmail.com
> > >> > > > > > > > To: dev@opennlp.apache.org
> > >> > > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > I saw that there are people actively working on a Word
> > Sense
> > >> > > > > > > Disambiguator.
> > >> > > > > > > > DO you guys know when will the module be ready to use?
> > Also
> > >> I
> > >> > > > assume
> > >> > > > > that
> > >> > > > > > > > wordnet is used to define the disambiguated word
> meaning?
> > >> > > > > > > >
> > >> > > > > > > > Thanks,
> > >> > > > > > > > Cristian
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Word Sense Disambiguator

Posted by Cristian Petroaca <cr...@gmail.com>.

Hi Anthony,

Thanks. I'd also be happy to help with whatever I can in order to bring
this component to trunk as soon as possible.

On Mon, Nov 2, 2015 at 2:10 PM, Anthony Beylerian <
anthonybeylerian@hotmail.com> wrote:

> Hello Cristian,
>
> Sorry for the late reply, I finally have a copy of a good corpus for
> coarse testing (OntoNotes).
> I will start working again on the component sometime this week.
>
> Best,
>
> Anthony
>
> > Date: Mon, 12 Oct 2015 15:24:46 +0300
> > Subject: Re: Word Sense Disambiguator
> > From: cristian.petroaca@gmail.com
> > To: dev@opennlp.apache.org
> >
> > Hi,
> >
> > Thanks Anthony for the info.
> > Does anybody else know when the WSD component will be merged into trunk
> and
> > possibly cut a release with it?
> >
> > Thanks
> >
> > On Sat, Sep 19, 2015 at 9:21 AM, Anthony Beylerian <
> > anthony.beylerian@gmail.com> wrote:
> >
> > > Hey Cristian,
> > >
> > > Sorry for the late reply, I am currently on summer break but will get
> back
> > > on it in one-two weeks.
> > >
> > > Can't really say when there will be a new release.
> > > This usually involves other components as well and it might take time
> to
> > > vote.
> > >
> > > However, some things to expect for the WSD component:
> > >
> > > - Support for the different types of classifiers for the supervised
> > > approaches (right now only ME based).
> > > - Support for augmenting the general domain training with specific
> domain
> > > information.
> > >
> > > Best,
> > >
> > > Anthony
> > >
> > >
> > > On Thu, Sep 17, 2015 at 11:18 PM, Cristian Petroaca <
> > > cristian.petroaca@gmail.com> wrote:
> > >
> > > > Hi Anthony,
> > > >
> > > > Do you know when will the WSD component be available in an OpenNLP
> > > release?
> > > >
> > > > Thanks,
> > > > Cristian
> > > >
> > > > On Thu, Sep 10, 2015 at 10:32 AM, Cristian Petroaca <
> > > > cristian.petroaca@gmail.com> wrote:
> > > >
> > > > > Yes, that's what I was looking for.
> > > > > Thanks Aliaksandr.
> > > > >
> > > > > On Wed, Sep 9, 2015 at 9:39 PM, Aliaksandr Autayeu <
> > > > aliaksandr@autayeu.com
> > > > > > wrote:
> > > > >
> > > > >> Cristian, the reference you gave basically uses synset offsets -
> 1740
> > > is
> > > > >> entity, 1930 is physical entity, etc. However, in YAGO they seems
> to
> > > > have
> > > > >> added 100000000 to those offsets.
> > > > >>
> > > > >> Synset offset is the fastest way to get into WordNet dictionary,
> > > because
> > > > >> it
> > > > >> is a direct file offset. Offset alone is not enough though, you
> also
> > > > need
> > > > >> POS - part of speech. Speed probably is the reason most people
> access
> > > > >> WordNet this way. However, offset is not the best "key",
> especially
> > > for
> > > > >> indexing, because offsets change as WordNet evolves. SenseKeys
> (e.g.
> > > > >> bank%1:14:00::
> > > > >> and bank%1:21:01::) should be more suitable for indexing.
> > > > >>
> > > > >> If you're looking to connect with YAGO above, you might do
> something
> > > > along
> > > > >> the lines of
> > > > >> ....getWordBySenseKey(sensekey).getSynset().getOffset and then add
> > > > >> 100000000
> > > > >> to get the YAGO ids.
> > > > >>
> > > > >> Aliaksandr
> > > > >>
> > > > >>
> > > > >> On 9 September 2015 at 09:51, Cristian Petroaca <
> > > > >> cristian.petroaca@gmail.com
> > > > >> > wrote:
> > > > >>
> > > > >> > I am looking for the Sense Id of the word. It has this format
> here :
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> http://resources.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoWordnetIds.txt
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Sep 8, 2015 at 6:47 PM, Anthony Beylerian <
> > > > >> > anthony.beylerian@gmail.com> wrote:
> > > > >> >
> > > > >> > > Hi,
> > > > >> > >
> > > > >> > > Thanks it is still being improved.
> > > > >> > >
> > > > >> > > I am not sure what you mean by type or database ID.
> > > > >> > > Currently the sense source and the sense ID are returned.
> > > > >> > >
> > > > >> > > For example:
> > > > >> > >
> > > > >> > > "I went to the bank to deposit money."
> > > > >> > > target : bank (index : 4)
> > > > >> > > expected output : [WORDNET bank%1:14:00:: 21.6, WORDNET
> > > > bank%1:21:01::
> > > > >> > > 5.8,... etc]
> > > > >> > >
> > > > >> > > Where "bank%1:14:00::" is a SenseKey which you can query
> WordNet
> > > > with
> > > > >> to
> > > > >> > > give you a sense definition.
> > > > >> > >
> > > > >> > > You can do this using the default dictionary :
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> Dictionary.getDefaultResourceInstance().getWordBySenseKey(sensekey).getSynset().getGloss();
> > > > >> > >
> > > > >> > > Hope this is what you are looking for, otherwise please
> clarify.
> > > > >> > >
> > > > >> > > Anthony Beylerian
> > > > >> > >
> > > > >> > > On Tue, Sep 8, 2015 at 5:34 PM, Cristian Petroaca <
> > > > >> > > cristian.petroaca@gmail.com> wrote:
> > > > >> > >
> > > > >> > > > Hi Anthony,
> > > > >> > > >
> > > > >> > > > I had a chance to test the wsd component. That's great work.
> > > > Thanks.
> > > > >> > > > One question, is it possible to return the wordnet type (or
> > > > database
> > > > >> > id)
> > > > >> > > of
> > > > >> > > > the disambiguated word?
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > > Cristian
> > > > >> > > >
> > > > >> > > > On Fri, Jul 24, 2015 at 1:14 PM, Anthony Beylerian <
> > > > >> > > > anthonybeylerian@hotmail.com> wrote:
> > > > >> > > >
> > > > >> > > > > Hi,
> > > > >> > > > >
> > > > >> > > > > To try out the ongoing implementations, after checking
> out the
> > > > >> > sandbox
> > > > >> > > > > repository please try these steps :
> > > > >> > > > > 1- Create a resource models directory:
> > > > >> > > > >
> > > > >> > > > > - src
> > > > >> > > > >   - test
> > > > >> > > > >     - resources
> > > > >> > > > >       + models
> > > > >> > > > >
> > > > >> > > > > 2- Include the following pre-trained models and
> dictionary in
> > > > that
> > > > >> > > > > directory:
> > > > >> > > > > You can find those here [1] if you like or pre-train your
> own
> > > > >> models.
> > > > >> > > > >
> > > > >> > > > > {
> > > > >> > > > > en-token.bin,
> > > > >> > > > > en-pos-maxent.bin,
> > > > >> > > > > en-sent.bin,en-ner-person.bin,en-lemmatizer.dict
> > > > >> > > > > }
> > > > >> > > > >
> > > > >> > > > > As to train the IMS approach you need to include training
> data
> > > > >> like
> > > > >> > > > > senseval3 [2]:
> > > > >> > > > > For now, please add these folders :
> > > > >> > > > > - src
> > > > >> > > > >   - test
> > > > >> > > > >     - resources
> > > > >> > > > >        - supervised
> > > > >> > > > >          + raw
> > > > >> > > > >          + models
> > > > >> > > > >          + dictionary
> > > > >> > > > >
> > > > >> > > > > You can find the data files here [2].
> > > > >> > > > >
> > > > >> > > > > 3- We included two examples [LeskTester.java] and
> > > > [IMSTester.java]
> > > > >> > that
> > > > >> > > > > you can run directly, or make your own tests.
> > > > >> > > > >
> > > > >> > > > > To run a custom test, minimally you need to have a
> tokenized
> > > > text
> > > > >> or
> > > > >> > > > > sentence  for example for Lesk:
> > > > >> > > > >
> > > > >> > > > >           1>> String[] words =
> > > > >> > > Loader.getTokenizer().tokenize(sentence);
> > > > >> > > > >
> > > > >> > > > > Chose the index of the word to disambiguate in the token
> > > array.
> > > > >> > > > >
> > > > >> > > > >           2>> int wordIndex= 6;
> > > > >> > > > >
> > > > >> > > > > Then just create a WSDisambiguator object for example for
> > > Lesk :
> > > > >> > > > >
> > > > >> > > > >          3>> Lesk lesk = new Lesk();
> > > > >> > > > >
> > > > >> > > > > And you can call the default disambiguation method
> > > > >> > > > >
> > > > >> > > > >          4>> lesk.disambiguate(words,wordIndex);
> > > > >> > > > >
> > > > >> > > > > You will get an array of strings with the following
> format :
> > > > >> > > > >
> > > > >> > > > > Lesk : [Source SenseKey Score]
> > > > >> > > > >
> > > > >> > > > > To read the sense definitions you can use the method :
> > > > >> > > > > [opennlp.tools.disambiguator.Constants.printResults]
> > > > >> > > > >
> > > > >> > > > > For using the variations of Lesk, you will need to create
> and
> > > > >> > > configure a
> > > > >> > > > > parameters object:
> > > > >> > > > >           5>> LeskParameters leskParams = new
> > > LeskParameters();
> > > > >> > > > > 6>>
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> leskParams.setLeskType(LeskParameters.LESK_TYPE.LESK_BASIC_CTXT_WIN_BF);
> > > > >> > > > >       7>> leskParams.setWin_b_size(4);          8>>
> > > > >> > > > > leskParams.setDepth(3);          9>>
> > > lesk.setParams(leskParams);
> > > > >> > > > >
> > > > >> > > > > Typically, IMS should perform better than Lesk, since
> Lesk is
> > > a
> > > > >> > classic
> > > > >> > > > > method but it usually used as a baseline along with the
> most
> > > > >> frequent
> > > > >> > > > sense
> > > > >> > > > > (MFS).
> > > > >> > > > > However, we will be testing and adding more techniques.
> > > > >> > > > >
> > > > >> > > > > In any case, please feel free to ask for more details.
> > > > >> > > > >
> > > > >> > > > > Best,
> > > > >> > > > >
> > > > >> > > > > Anthony
> > > > >> > > > >
> > > > >> > > > > [1] :
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://drive.google.com/folderview?id=0B67Iu3pf6WucfjdYNGhDc3hkTXd1a3FORnNUYzd3dV9YeWlyMFczeHU0SE1TcWwyU1lhZFU&usp=sharing
> > > > >> > > > > [2] :
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://drive.google.com/file/d/0ByL0dmKXzHVfSXA3SVZiMnVfOGc/view?usp=sharing
> > > > >> > > > > > Date: Fri, 24 Jul 2015 09:54:02 +0200
> > > > >> > > > > > Subject: Re: Word Sense Disambiguator
> > > > >> > > > > > From: kottmann@gmail.com
> > > > >> > > > > > To: dev@opennlp.apache.org
> > > > >> > > > > >
> > > > >> > > > > > It would be nice if you could share instructions on how
> to
> > > run
> > > > >> it.
> > > > >> > > > > > I also would like to give it a try.
> > > > >> > > > > >
> > > > >> > > > > > Jörn
> > > > >> > > > > >
> > > > >> > > > > > On Fri, Jul 24, 2015 at 4:54 AM, Anthony Beylerian <
> > > > >> > > > > > anthonybeylerian@hotmail.com> wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hello,
> > > > >> > > > > > > Yes for the moment we are only using WordNet for sense
> > > > >> > > > definitions.The
> > > > >> > > > > > > plan is to complete the package by mid to late
> August, but
> > > > if
> > > > >> you
> > > > >> > > > like
> > > > >> > > > > you
> > > > >> > > > > > > can follow up on the progress from the sandbox.
> > > > >> > > > > > > Best regards,
> > > > >> > > > > > > Anthony
> > > > >> > > > > > > > Date: Thu, 23 Jul 2015 15:36:57 +0300
> > > > >> > > > > > > > Subject: Word Sense Disambiguator
> > > > >> > > > > > > > From: cristian.petroaca@gmail.com
> > > > >> > > > > > > > To: dev@opennlp.apache.org
> > > > >> > > > > > > >
> > > > >> > > > > > > > Hi,
> > > > >> > > > > > > >
> > > > >> > > > > > > > I saw that there are people actively working on a
> Word
> > > > Sense
> > > > >> > > > > > > Disambiguator.
> > > > >> > > > > > > > DO you guys know when will the module be ready to
> use?
> > > > Also
> > > > >> I
> > > > >> > > > assume
> > > > >> > > > > that
> > > > >> > > > > > > > wordnet is used to define the disambiguated word
> > > meaning?
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks,
> > > > >> > > > > > > > Cristian
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
>
>

RE: Word Sense Disambiguator

Posted by Anthony Beylerian <an...@hotmail.com>.

Hello Cristian,

Sorry for the late reply, I finally have a copy of a good corpus for coarse testing (OntoNotes).
I will start working again on the component sometime this week.

Best,

Anthony 

> Date: Mon, 12 Oct 2015 15:24:46 +0300
> Subject: Re: Word Sense Disambiguator
> From: cristian.petroaca@gmail.com
> To: dev@opennlp.apache.org
> 
> Hi,
> 
> Thanks Anthony for the info.
> Does anybody else know when the WSD component will be merged into trunk and
> possibly cut a release with it?
> 
> Thanks
> 
> On Sat, Sep 19, 2015 at 9:21 AM, Anthony Beylerian <
> anthony.beylerian@gmail.com> wrote:
> 
> > Hey Cristian,
> >
> > Sorry for the late reply, I am currently on summer break but will get back
> > on it in one-two weeks.
> >
> > Can't really say when there will be a new release.
> > This usually involves other components as well and it might take time to
> > vote.
> >
> > However, some things to expect for the WSD component:
> >
> > - Support for the different types of classifiers for the supervised
> > approaches (right now only ME based).
> > - Support for augmenting the general domain training with specific domain
> > information.
> >
> > Best,
> >
> > Anthony
> >
> >
> > On Thu, Sep 17, 2015 at 11:18 PM, Cristian Petroaca <
> > cristian.petroaca@gmail.com> wrote:
> >
> > > Hi Anthony,
> > >
> > > Do you know when will the WSD component be available in an OpenNLP
> > release?
> > >
> > > Thanks,
> > > Cristian
> > >
> > > On Thu, Sep 10, 2015 at 10:32 AM, Cristian Petroaca <
> > > cristian.petroaca@gmail.com> wrote:
> > >
> > > > Yes, that's what I was looking for.
> > > > Thanks Aliaksandr.
> > > >
> > > > On Wed, Sep 9, 2015 at 9:39 PM, Aliaksandr Autayeu <
> > > aliaksandr@autayeu.com
> > > > > wrote:
> > > >
> > > >> Cristian, the reference you gave basically uses synset offsets - 1740
> > is
> > > >> entity, 1930 is physical entity, etc. However, in YAGO they seems to
> > > have
> > > >> added 100000000 to those offsets.
> > > >>
> > > >> Synset offset is the fastest way to get into WordNet dictionary,
> > because
> > > >> it
> > > >> is a direct file offset. Offset alone is not enough though, you also
> > > need
> > > >> POS - part of speech. Speed probably is the reason most people access
> > > >> WordNet this way. However, offset is not the best "key", especially
> > for
> > > >> indexing, because offsets change as WordNet evolves. SenseKeys (e.g.
> > > >> bank%1:14:00::
> > > >> and bank%1:21:01::) should be more suitable for indexing.
> > > >>
> > > >> If you're looking to connect with YAGO above, you might do something
> > > along
> > > >> the lines of
> > > >> ....getWordBySenseKey(sensekey).getSynset().getOffset and then add
> > > >> 100000000
> > > >> to get the YAGO ids.
> > > >>
> > > >> Aliaksandr
> > > >>
> > > >>
> > > >> On 9 September 2015 at 09:51, Cristian Petroaca <
> > > >> cristian.petroaca@gmail.com
> > > >> > wrote:
> > > >>
> > > >> > I am looking for the Sense Id of the word. It has this format here :
> > > >> >
> > > >> >
> > > >>
> > >
> > http://resources.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoWordnetIds.txt
> > > >> >
> > > >> >
> > > >> > On Tue, Sep 8, 2015 at 6:47 PM, Anthony Beylerian <
> > > >> > anthony.beylerian@gmail.com> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > Thanks it is still being improved.
> > > >> > >
> > > >> > > I am not sure what you mean by type or database ID.
> > > >> > > Currently the sense source and the sense ID are returned.
> > > >> > >
> > > >> > > For example:
> > > >> > >
> > > >> > > "I went to the bank to deposit money."
> > > >> > > target : bank (index : 4)
> > > >> > > expected output : [WORDNET bank%1:14:00:: 21.6, WORDNET
> > > bank%1:21:01::
> > > >> > > 5.8,... etc]
> > > >> > >
> > > >> > > Where "bank%1:14:00::" is a SenseKey which you can query WordNet
> > > with
> > > >> to
> > > >> > > give you a sense definition.
> > > >> > >
> > > >> > > You can do this using the default dictionary :
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > Dictionary.getDefaultResourceInstance().getWordBySenseKey(sensekey).getSynset().getGloss();
> > > >> > >
> > > >> > > Hope this is what you are looking for, otherwise please clarify.
> > > >> > >
> > > >> > > Anthony Beylerian
> > > >> > >
> > > >> > > On Tue, Sep 8, 2015 at 5:34 PM, Cristian Petroaca <
> > > >> > > cristian.petroaca@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hi Anthony,
> > > >> > > >
> > > >> > > > I had a chance to test the wsd component. That's great work.
> > > Thanks.
> > > >> > > > One question, is it possible to return the wordnet type (or
> > > database
> > > >> > id)
> > > >> > > of
> > > >> > > > the disambiguated word?
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Cristian
> > > >> > > >
> > > >> > > > On Fri, Jul 24, 2015 at 1:14 PM, Anthony Beylerian <
> > > >> > > > anthonybeylerian@hotmail.com> wrote:
> > > >> > > >
> > > >> > > > > Hi,
> > > >> > > > >
> > > >> > > > > To try out the ongoing implementations, after checking out the
> > > >> > sandbox
> > > >> > > > > repository please try these steps :
> > > >> > > > > 1- Create a resource models directory:
> > > >> > > > >
> > > >> > > > > - src
> > > >> > > > >   - test
> > > >> > > > >     - resources
> > > >> > > > >       + models
> > > >> > > > >
> > > >> > > > > 2- Include the following pre-trained models and dictionary in
> > > that
> > > >> > > > > directory:
> > > >> > > > > You can find those here [1] if you like or pre-train your own
> > > >> models.
> > > >> > > > >
> > > >> > > > > {
> > > >> > > > > en-token.bin,
> > > >> > > > > en-pos-maxent.bin,
> > > >> > > > > en-sent.bin,en-ner-person.bin,en-lemmatizer.dict
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > As to train the IMS approach you need to include training data
> > > >> like
> > > >> > > > > senseval3 [2]:
> > > >> > > > > For now, please add these folders :
> > > >> > > > > - src
> > > >> > > > >   - test
> > > >> > > > >     - resources
> > > >> > > > >        - supervised
> > > >> > > > >          + raw
> > > >> > > > >          + models
> > > >> > > > >          + dictionary
> > > >> > > > >
> > > >> > > > > You can find the data files here [2].
> > > >> > > > >
> > > >> > > > > 3- We included two examples [LeskTester.java] and
> > > [IMSTester.java]
> > > >> > that
> > > >> > > > > you can run directly, or make your own tests.
> > > >> > > > >
> > > >> > > > > To run a custom test, minimally you need to have a tokenized
> > > text
> > > >> or
> > > >> > > > > sentence  for example for Lesk:
> > > >> > > > >
> > > >> > > > >           1>> String[] words =
> > > >> > > Loader.getTokenizer().tokenize(sentence);
> > > >> > > > >
> > > >> > > > > Chose the index of the word to disambiguate in the token
> > array.
> > > >> > > > >
> > > >> > > > >           2>> int wordIndex= 6;
> > > >> > > > >
> > > >> > > > > Then just create a WSDisambiguator object for example for
> > Lesk :
> > > >> > > > >
> > > >> > > > >          3>> Lesk lesk = new Lesk();
> > > >> > > > >
> > > >> > > > > And you can call the default disambiguation method
> > > >> > > > >
> > > >> > > > >          4>> lesk.disambiguate(words,wordIndex);
> > > >> > > > >
> > > >> > > > > You will get an array of strings with the following format :
> > > >> > > > >
> > > >> > > > > Lesk : [Source SenseKey Score]
> > > >> > > > >
> > > >> > > > > To read the sense definitions you can use the method :
> > > >> > > > > [opennlp.tools.disambiguator.Constants.printResults]
> > > >> > > > >
> > > >> > > > > For using the variations of Lesk, you will need to create and
> > > >> > > configure a
> > > >> > > > > parameters object:
> > > >> > > > >           5>> LeskParameters leskParams = new
> > LeskParameters();
> > > >> > > > > 6>>
> > > >> > > > >
> > > >> > >
> > > >>
> > leskParams.setLeskType(LeskParameters.LESK_TYPE.LESK_BASIC_CTXT_WIN_BF);
> > > >> > > > >       7>> leskParams.setWin_b_size(4);          8>>
> > > >> > > > > leskParams.setDepth(3);          9>>
> > lesk.setParams(leskParams);
> > > >> > > > >
> > > >> > > > > Typically, IMS should perform better than Lesk, since Lesk is
> > a
> > > >> > classic
> > > >> > > > > method but it usually used as a baseline along with the most
> > > >> frequent
> > > >> > > > sense
> > > >> > > > > (MFS).
> > > >> > > > > However, we will be testing and adding more techniques.
> > > >> > > > >
> > > >> > > > > In any case, please feel free to ask for more details.
> > > >> > > > >
> > > >> > > > > Best,
> > > >> > > > >
> > > >> > > > > Anthony
> > > >> > > > >
> > > >> > > > > [1] :
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://drive.google.com/folderview?id=0B67Iu3pf6WucfjdYNGhDc3hkTXd1a3FORnNUYzd3dV9YeWlyMFczeHU0SE1TcWwyU1lhZFU&usp=sharing
> > > >> > > > > [2] :
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://drive.google.com/file/d/0ByL0dmKXzHVfSXA3SVZiMnVfOGc/view?usp=sharing
> > > >> > > > > > Date: Fri, 24 Jul 2015 09:54:02 +0200
> > > >> > > > > > Subject: Re: Word Sense Disambiguator
> > > >> > > > > > From: kottmann@gmail.com
> > > >> > > > > > To: dev@opennlp.apache.org
> > > >> > > > > >
> > > >> > > > > > It would be nice if you could share instructions on how to
> > run
> > > >> it.
> > > >> > > > > > I also would like to give it a try.
> > > >> > > > > >
> > > >> > > > > > Jörn
> > > >> > > > > >
> > > >> > > > > > On Fri, Jul 24, 2015 at 4:54 AM, Anthony Beylerian <
> > > >> > > > > > anthonybeylerian@hotmail.com> wrote:
> > > >> > > > > >
> > > >> > > > > > > Hello,
> > > >> > > > > > > Yes for the moment we are only using WordNet for sense
> > > >> > > > definitions.The
> > > >> > > > > > > plan is to complete the package by mid to late August, but
> > > if
> > > >> you
> > > >> > > > like
> > > >> > > > > you
> > > >> > > > > > > can follow up on the progress from the sandbox.
> > > >> > > > > > > Best regards,
> > > >> > > > > > > Anthony
> > > >> > > > > > > > Date: Thu, 23 Jul 2015 15:36:57 +0300
> > > >> > > > > > > > Subject: Word Sense Disambiguator
> > > >> > > > > > > > From: cristian.petroaca@gmail.com
> > > >> > > > > > > > To: dev@opennlp.apache.org
> > > >> > > > > > > >
> > > >> > > > > > > > Hi,
> > > >> > > > > > > >
> > > >> > > > > > > > I saw that there are people actively working on a Word
> > > Sense
> > > >> > > > > > > Disambiguator.
> > > >> > > > > > > > DO you guys know when will the module be ready to use?
> > > Also
> > > >> I
> > > >> > > > assume
> > > >> > > > > that
> > > >> > > > > > > > wordnet is used to define the disambiguated word
> > meaning?
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks,
> > > >> > > > > > > > Cristian
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >