You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Boris Galitsky <bg...@hotmail.com> on 2012/03/25 00:09:49 UTC

skype call?


Hi 
Jörn 
May be we can talk on skype to discuss the progress of Similarity component? I closed all tickets we were discussing 2 months ago, and added couple of new interesting features.

RegardsBoris

Re: minutes of the skype call on Similarity component

Posted by Jörn Kottmann <ko...@gmail.com>.

On 03/28/2012 11:02 PM, Aliaksandr Autayeu wrote:
> One small note on "b) improve cacheing. Now it is implemented via java
> object serialization; make it via CSV files".
> If you'll use some library for CSV, you might as well think about Google
> Protocol Buffers. They are pretty fast.

The code reads test data from this file and this is only done during the
unit tests. Using object serialization doesn't really work because it
depends on the VM version.

The best solution would be to just read in Parse trees, but this is 
currently
not possible because the file contains a Parse and shallow parse.
To fix that the current alignment code would need to output a Parse 
object again.
Another advantage of just having a Parse object is, that it makes the 
interface to the
similarity component simpler.

Anyway, everyone who is interested in the similarity component should 
have a look
at the documentation Boris created.

Any comments and suggestions are very welcome!

Jörn

Re: minutes of the skype call on Similarity component

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

Hi Boris,

Thank you!

One small note on "b) improve cacheing. Now it is implemented via java
object serialization; make it via CSV files".
If you'll use some library for CSV, you might as well think about Google
Protocol Buffers. They are pretty fast.

regards,
Aliaksandr


On Wed, Mar 28, 2012 at 10:42 PM, Boris Galitsky <bg...@hotmail.com>wrote:

>
>
>
>
> Hi guys
>  per Aliaksandr's suggestion, below are the minutes of our conversation
> with Jorn about Similarity component and other related issues
> 1) Prepare Similarity fro release from sandbox:
>
>      a) improve readme.txt, add 'The entry point to
> Similarity component is
>
> SentencePairMatchResult matchRes =
> sm.assessRelevance(sentence1,sentence2);
>
> where matchRes includes the similarity score (weighted number  of common
> terms) and the set of maximum
> common parse trees.
>      b) improve cacheing. Now it is implemented via java object
> serialization; make it via CSV files
>      c) proper location for cache files and resources:      joernkottmann:
> src/test/resources      d) verify porter stemmer (remove lucene
> dependecies, remove porter stemmer from /similarity      e)re-format code,
> use eclipse template for re-format          joernkottmann:
> http://opennlp.apache.org/code-conventions.html      f) package into
> separate jar/ src using Maven
>  2) Next major feature of Similarity: taxonomy auto learning and using
> taxonomy to improve search relevance      a)  see how Similarity component
> can help with search tasks'      b) integration with SOLR
> (compare/complement github.com/tamingtext of Grant Ingersoll with
> Similarity). there are some  JIRA issue opened for hooking in some of
> tamingtext  stuff to the analyzers modules in Solr     3) More examples and
> docs for similarity component      a) examples for finding similar news at
> allvoices.com                email the code which generates search query
> for news articles      b)email the link to the papers on
> joernkottmann: https://cwiki.apache.org/OPENNLP/nlp-papers.html
>  4) Other future features/improvements for Similarity      a) how can we
> create a more accurate Parse object running chunker separately and then
> applying alignment algorithm      b) Coreference component
> joernkottmann: TreebankNameFinder      c) apply machine learning to parse
> trees + coreferences. " parse forest": is it a   good name?
>  joernkottmann: CorefSample.
> RegardsBoris
>
>

fixed items for Similarity / RE: minutes of the skype call on Similarity component

Posted by Boris Galitsky <bg...@hotmail.com>.





Hi guys

I want to indicate which items
indicated in my previous status email are fixed now:

1) Prepare Similarity fro release from sandbox:

      a) improve readme.txt, add 'The entry point to
Similarity component is

SentencePairMatchResult matchRes =
sm.assessRelevance(sentence1,sentence2);

where matchRes includes the similarity score (weighted number  of common terms) and the set of maximum
common parse trees.
>>> Done
      b) improve caching. Now it is implemented via java object serialization; make it via CSV files>>> Done      c) proper location for cache files and resources:      joernkottmann: src/test/resources
>>> Done      d) verify porter stemmer (remove lucene dependecies, remove porter stemmer from /similarity>>> That will be done outside of Simlarity. Right now downloadable opennlp-tools 1.5.2      do not have Porter sytemmer. so I temporarily have it within Similarity
e)re-format code, use eclipse template for re-format          joernkottmann: http://opennlp.apache.org/code-conventions.html
>>> Done      f) package into separate jar/ src using Maven
 2) Next major feature of Similarity: taxonomy auto learning and using taxonomy to improve search relevance      a)  see how Similarity component can help with search tasks>>> Done. .     3) More examples and docs for similarity component      a) examples for finding similar news at allvoices.com>>> Started, but not easy to integrate into Similarity because tightly connected with the original project
                email the code which generates search query for news articles      b)email the link to the papers on       joernkottmann: https://cwiki.apache.org/OPENNLP/nlp-papers.html>>> I extended the list with new section on the papers on similarity'
  4) Other future features/improvements for Similarity<<< These are FUTURE items
      a) how can we create a more accurate Parse object running chunker separately and then applying alignment algorithm      b) Coreference component         joernkottmann: TreebankNameFinder      c) apply machine learning to parse trees + coreferences. " parse forest": is it a   good name?        joernkottmann: CorefSample.
RegardsBoris

minutes of the skype call on Similarity component

Posted by Boris Galitsky <bg...@hotmail.com>.




Hi guys
 per Aliaksandr's suggestion, below are the minutes of our conversation with Jorn about Similarity component and other related issues
1) Prepare Similarity fro release from sandbox:

      a) improve readme.txt, add 'The entry point to
Similarity component is

SentencePairMatchResult matchRes =
sm.assessRelevance(sentence1,sentence2);

where matchRes includes the similarity score (weighted number  of common terms) and the set of maximum
common parse trees.
      b) improve cacheing. Now it is implemented via java object serialization; make it via CSV files
      c) proper location for cache files and resources:      joernkottmann: src/test/resources      d) verify porter stemmer (remove lucene dependecies, remove porter stemmer from /similarity      e)re-format code, use eclipse template for re-format          joernkottmann: http://opennlp.apache.org/code-conventions.html      f) package into separate jar/ src using Maven
 2) Next major feature of Similarity: taxonomy auto learning and using taxonomy to improve search relevance      a)  see how Similarity component can help with search tasks'      b) integration with SOLR (compare/complement github.com/tamingtext of Grant Ingersoll with Similarity). there are some  JIRA issue opened for hooking in some of tamingtext  stuff to the analyzers modules in Solr     3) More examples and docs for similarity component      a) examples for finding similar news at allvoices.com                email the code which generates search query for news articles      b)email the link to the papers on       joernkottmann: https://cwiki.apache.org/OPENNLP/nlp-papers.html
  4) Other future features/improvements for Similarity      a) how can we create a more accurate Parse object running chunker separately and then applying alignment algorithm      b) Coreference component         joernkottmann: TreebankNameFinder      c) apply machine learning to parse trees + coreferences. " parse forest": is it a   good name?        joernkottmann: CorefSample.
RegardsBoris

RE: skype call?

Posted by Boris Galitsky <bg...@hotmail.com>.

Thanks for the interest!

I will include the minutes in the relevant latest tickets.
Boris



> Date: Sun, 25 Mar 2012 23:53:29 +0200
> Subject: Re: skype call?
> From: aliaksandr@autayeu.com
> To: dev@opennlp.apache.org
> 
> It would be nice then to peek at the minutes... or may be there is another
> way for others to follow your call?
> 
> Aliaksandr
> 
> On Sun, Mar 25, 2012 at 12:09 AM, Boris Galitsky <bg...@hotmail.com>wrote:
> 
> >
> >
> > Hi
> > Jörn
> > May be we can talk on skype to discuss the progress of Similarity
> > component? I closed all tickets we were discussing 2 months ago, and added
> > couple of new interesting features.
> >
> > RegardsBoris
> >
> >
> >
> >

Re: skype call?

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

It would be nice then to peek at the minutes... or may be there is another
way for others to follow your call?

Aliaksandr

On Sun, Mar 25, 2012 at 12:09 AM, Boris Galitsky <bg...@hotmail.com>wrote:

>
>
> Hi
> Jörn
> May be we can talk on skype to discuss the progress of Similarity
> component? I closed all tickets we were discussing 2 months ago, and added
> couple of new interesting features.
>
> RegardsBoris
>
>
>
>