You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Lance Norskog <go...@gmail.com> on 2012/05/31 02:54:18 UTC

Patch for Lucene/Solr

I'm creating a patch to integrate OpenNLP into the Lucene/Solr
project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
tools are included. The SentenceDetector and Tokenizer are a Lucene
Tokenizer, and a Lucene TokenFilter takes this stream and runs
POS/Chunking/NER on it, saving the tags as upper-case payloads. The
patch includes a couple of handy combinations. For example, make a
more focused search index by only indexing the nouns & verbs.

Do you have any hints on how to package it? The documentation should
include how to download and install the models.

-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Lance,

there is already a Jira issue for that you may attach your patch [1].
I thought to that some time ago and it could be done by using openNLP UIMA
integration on top of lucene-analysis-uima tokenizers [2].
For filtering out some PoS tagged tokens one could use
the UIMATypeAwareAnnotationsTokenizer [3] with the TypeTokenFilter [4].
That would use all existing Lucene piecese, however also a plain
integration would be good to avoid unnecessary layers if not needed.
My 2 cents,
Tommaso


[1] : https://issues.apache.org/jira/browse/LUCENE-2899
[2] : http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/
[3] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/src/java/org/apache/lucene/analysis/uima/UIMATypeAwareAnnotationsTokenizer
[4] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/TypeTokenFilter.java

2012/5/31 Lance Norskog <go...@gmail.com>

> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
> tools are included. The SentenceDetector and Tokenizer are a Lucene
> Tokenizer, and a Lucene TokenFilter takes this stream and runs
> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
> patch includes a couple of handy combinations. For example, make a
> more focused search index by only indexing the nouns & verbs.
>
> Do you have any hints on how to package it? The documentation should
> include how to download and install the models.
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Patch for Lucene/Solr

Posted by James Kosin <ja...@gmail.com>.
I use the sources from sourceforge for the dependancy.
    http://sourceforge.net/projects/jwordnet/


On 6/16/2012 9:20 AM, Jörn Kottmann wrote:
> A pragmatic solution for you could be to just exclude this
> dependency. It is only needed for coref.
>
> Yes, we should consider switching!
>
> Jörn
>
> On 06/16/2012 02:15 PM, Aliaksandr Autayeu wrote:
>> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is almost
>> 100% compatible. The package names have changed, though.
>>
>> Aliaksandr
>>
>> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog<go...@gmail.com> 
>> wrote:
>>
>>> Turns out I can just put text files in the patch and it will accept
>>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
>>> not in Maven.
>>>
>>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann<ko...@gmail.com> 
>>> wrote:
>>>> On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>>>> The opennlp build needs a little upgrading to work with the license
>>>>> validation in the Lucene build. OPENNLP-511 requests this.
>>>>
>>>> I will have a look at it for the next release. Planning to start
>>>> working
>>>> on it soon.
>>>>
>>>> Jörn
>>>
>>>
>>> -- 
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>



Re: Patch for Lucene/Solr

Posted by Jörn Kottmann <ko...@gmail.com>.
A pragmatic solution for you could be to just exclude this
dependency. It is only needed for coref.

Yes, we should consider switching!

Jörn

On 06/16/2012 02:15 PM, Aliaksandr Autayeu wrote:
> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is almost
> 100% compatible. The package names have changed, though.
>
> Aliaksandr
>
> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog<go...@gmail.com>  wrote:
>
>> Turns out I can just put text files in the patch and it will accept
>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
>> not in Maven.
>>
>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann<ko...@gmail.com>  wrote:
>>> On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>>> The opennlp build needs a little upgrading to work with the license
>>>> validation in the Lucene build. OPENNLP-511 requests this.
>>>
>>> I will have a look at it for the next release. Planning to start working
>>> on it soon.
>>>
>>> Jörn
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>


Re: Patch for Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
Yes, this is for the trunk and the 4.x branch.

Run this at the top level:
patch -p0 < LUCENE-2899.patch

Then follow the directions on the Solr wiki. This takes two steps to
compile. (When it is committed the build will be normal.)
http://wiki.apache.org/solr/OpenNLP

This is a Solr project, so ask future questions on solr-user@lucene.apache.org.

Lance (the author)

On Wed, Jul 11, 2012 at 3:07 PM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> Hi Sam,
>
> I think you should use trunk version of Lucene/Solr, instead of 3.6
> to apply the patch.
>
> koji
> --
> http://soleami.com/blog/starting-lab-work.html
>
>
>
> (12/07/12 0:57), sam wu wrote:
>>
>> Hi, I am trying to follow the instruction, and test NLP*Filter.
>>
>> I downloaded Lucene/Solr 3.6, with all the necessary opennlp stuff
>> (library, model)
>> Then try to patch, JIRA 2899 has several patch files,..
>> 1. Is the patch command -- "patch -p0 ???.patch" correct ?
>> 2. Do I have to apply all the patchs ?,
>>
>>
>> Thanks
>>
>> Sam
>>
>> On Wed, Jul 4, 2012 at 6:11 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>> The Solr wiki is updated, including directions for testing the patch:
>>> http://wiki.apache.org/solr/OpenNLP
>>>
>>> On Wed, Jul 4, 2012 at 4:54 PM, Lance Norskog <go...@gmail.com> wrote:
>>>>
>>>> Hello-
>>>>
>>>> A committable patch is up! The Lucene classes are in
>>>> lucene/analysis/opennlp and the Solr classes are in
>>>> solr/contrib/opennlp. Several bits of build script fu are in the
>>>> appropriate places.
>>>>
>>>> It uses 'jwnl', with version 1.4rc3. Yes, this is not what OpenNLP is
>>>> compiled against, but the build works and co-reference is not used in
>>>> this patch.
>>>>
>>>> The SentenceDetector, Tokenizer and POS/Chunking/NER are tested with
>>>> miniaturized models made from miniaturized test corpuses. They are
>>>> kawii.
>>>>
>>>> http://wiki.apache.org/solr/OpenNLP
>>>>
>>>> On Wed, Jun 27, 2012 at 1:06 AM, Lance Norskog <go...@gmail.com>
>>>
>>> wrote:
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Aliaksandr Autayeu <al...@autayeu.com>
>>>>> Date: Sat, Jun 16, 2012 at 5:15 AM
>>>>> Subject: Re: Patch for Lucene/Solr
>>>>> To: users@opennlp.apache.org
>>>>>
>>>>>
>>>>> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is
>>>
>>> almost
>>>>>
>>>>> 100% compatible. The package names have changed, though.
>>>>>
>>>>> Aliaksandr
>>>>>
>>>>> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog <go...@gmail.com>
>>>
>>> wrote:
>>>>>
>>>>>
>>>>>> Turns out I can just put text files in the patch and it will accept
>>>>>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
>>>>>> not in Maven.
>>>>>>
>>>>>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com>
>>>
>>> wrote:
>>>>>>>
>>>>>>> On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> The opennlp build needs a little upgrading to work with the license
>>>>>>>> validation in the Lucene build. OPENNLP-511 requests this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I will have a look at it for the next release. Planning to start
>>>
>>> working
>>>>>>>
>>>>>>> on it soon.
>>>>>>>
>>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hi Sam,

I think you should use trunk version of Lucene/Solr, instead of 3.6
to apply the patch.

koji
-- 
http://soleami.com/blog/starting-lab-work.html


(12/07/12 0:57), sam wu wrote:
> Hi, I am trying to follow the instruction, and test NLP*Filter.
>
> I downloaded Lucene/Solr 3.6, with all the necessary opennlp stuff
> (library, model)
> Then try to patch, JIRA 2899 has several patch files,..
> 1. Is the patch command -- "patch -p0 ???.patch" correct ?
> 2. Do I have to apply all the patchs ?,
>
>
> Thanks
>
> Sam
>
> On Wed, Jul 4, 2012 at 6:11 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> The Solr wiki is updated, including directions for testing the patch:
>> http://wiki.apache.org/solr/OpenNLP
>>
>> On Wed, Jul 4, 2012 at 4:54 PM, Lance Norskog <go...@gmail.com> wrote:
>>> Hello-
>>>
>>> A committable patch is up! The Lucene classes are in
>>> lucene/analysis/opennlp and the Solr classes are in
>>> solr/contrib/opennlp. Several bits of build script fu are in the
>>> appropriate places.
>>>
>>> It uses 'jwnl', with version 1.4rc3. Yes, this is not what OpenNLP is
>>> compiled against, but the build works and co-reference is not used in
>>> this patch.
>>>
>>> The SentenceDetector, Tokenizer and POS/Chunking/NER are tested with
>>> miniaturized models made from miniaturized test corpuses. They are
>>> kawii.
>>>
>>> http://wiki.apache.org/solr/OpenNLP
>>>
>>> On Wed, Jun 27, 2012 at 1:06 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>>>> ---------- Forwarded message ----------
>>>> From: Aliaksandr Autayeu <al...@autayeu.com>
>>>> Date: Sat, Jun 16, 2012 at 5:15 AM
>>>> Subject: Re: Patch for Lucene/Solr
>>>> To: users@opennlp.apache.org
>>>>
>>>>
>>>> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is
>> almost
>>>> 100% compatible. The package names have changed, though.
>>>>
>>>> Aliaksandr
>>>>
>>>> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>>>>
>>>>> Turns out I can just put text files in the patch and it will accept
>>>>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
>>>>> not in Maven.
>>>>>
>>>>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com>
>> wrote:
>>>>>> On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>>>>>>
>>>>>>> The opennlp build needs a little upgrading to work with the license
>>>>>>> validation in the Lucene build. OPENNLP-511 requests this.
>>>>>>
>>>>>>
>>>>>> I will have a look at it for the next release. Planning to start
>> working
>>>>>> on it soon.
>>>>>>
>>>>>> Jörn
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>




Re: Patch for Lucene/Solr

Posted by sam wu <sw...@gmail.com>.
Hi, I am trying to follow the instruction, and test NLP*Filter.

I downloaded Lucene/Solr 3.6, with all the necessary opennlp stuff
(library, model)
Then try to patch, JIRA 2899 has several patch files,..
1. Is the patch command -- "patch -p0 ???.patch" correct ?
2. Do I have to apply all the patchs ?,


Thanks

Sam

On Wed, Jul 4, 2012 at 6:11 PM, Lance Norskog <go...@gmail.com> wrote:

> The Solr wiki is updated, including directions for testing the patch:
> http://wiki.apache.org/solr/OpenNLP
>
> On Wed, Jul 4, 2012 at 4:54 PM, Lance Norskog <go...@gmail.com> wrote:
> > Hello-
> >
> > A committable patch is up! The Lucene classes are in
> > lucene/analysis/opennlp and the Solr classes are in
> > solr/contrib/opennlp. Several bits of build script fu are in the
> > appropriate places.
> >
> > It uses 'jwnl', with version 1.4rc3. Yes, this is not what OpenNLP is
> > compiled against, but the build works and co-reference is not used in
> > this patch.
> >
> > The SentenceDetector, Tokenizer and POS/Chunking/NER are tested with
> > miniaturized models made from miniaturized test corpuses. They are
> > kawii.
> >
> > http://wiki.apache.org/solr/OpenNLP
> >
> > On Wed, Jun 27, 2012 at 1:06 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >> ---------- Forwarded message ----------
> >> From: Aliaksandr Autayeu <al...@autayeu.com>
> >> Date: Sat, Jun 16, 2012 at 5:15 AM
> >> Subject: Re: Patch for Lucene/Solr
> >> To: users@opennlp.apache.org
> >>
> >>
> >> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is
> almost
> >> 100% compatible. The package names have changed, though.
> >>
> >> Aliaksandr
> >>
> >> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >>
> >>> Turns out I can just put text files in the patch and it will accept
> >>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
> >>> not in Maven.
> >>>
> >>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> >>> > On 06/06/2012 10:53 AM, Lance Norskog wrote:
> >>> >>
> >>> >> The opennlp build needs a little upgrading to work with the license
> >>> >> validation in the Lucene build. OPENNLP-511 requests this.
> >>> >
> >>> >
> >>> > I will have a look at it for the next release. Planning to start
> working
> >>> > on it soon.
> >>> >
> >>> > Jörn
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Patch for Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
The Solr wiki is updated, including directions for testing the patch:
http://wiki.apache.org/solr/OpenNLP

On Wed, Jul 4, 2012 at 4:54 PM, Lance Norskog <go...@gmail.com> wrote:
> Hello-
>
> A committable patch is up! The Lucene classes are in
> lucene/analysis/opennlp and the Solr classes are in
> solr/contrib/opennlp. Several bits of build script fu are in the
> appropriate places.
>
> It uses 'jwnl', with version 1.4rc3. Yes, this is not what OpenNLP is
> compiled against, but the build works and co-reference is not used in
> this patch.
>
> The SentenceDetector, Tokenizer and POS/Chunking/NER are tested with
> miniaturized models made from miniaturized test corpuses. They are
> kawii.
>
> http://wiki.apache.org/solr/OpenNLP
>
> On Wed, Jun 27, 2012 at 1:06 AM, Lance Norskog <go...@gmail.com> wrote:
>> ---------- Forwarded message ----------
>> From: Aliaksandr Autayeu <al...@autayeu.com>
>> Date: Sat, Jun 16, 2012 at 5:15 AM
>> Subject: Re: Patch for Lucene/Solr
>> To: users@opennlp.apache.org
>>
>>
>> extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is almost
>> 100% compatible. The package names have changed, though.
>>
>> Aliaksandr
>>
>> On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>> Turns out I can just put text files in the patch and it will accept
>>> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
>>> not in Maven.
>>>
>>> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>>> > On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>> >>
>>> >> The opennlp build needs a little upgrading to work with the license
>>> >> validation in the Lucene build. OPENNLP-511 requests this.
>>> >
>>> >
>>> > I will have a look at it for the next release. Planning to start working
>>> > on it soon.
>>> >
>>> > Jörn
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
>
>
> --
> Lance Norskog
> goksron@gmail.com



-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Aliaksandr Autayeu <al...@autayeu.com>.
extjwnl ( http://extjwnl.sourceforge.net/ ) is in Maven and it is almost
100% compatible. The package names have changed, though.

Aliaksandr

On Fri, Jun 15, 2012 at 11:31 PM, Lance Norskog <go...@gmail.com> wrote:

> Turns out I can just put text files in the patch and it will accept
> the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
> not in Maven.
>
> On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> > On 06/06/2012 10:53 AM, Lance Norskog wrote:
> >>
> >> The opennlp build needs a little upgrading to work with the license
> >> validation in the Lucene build. OPENNLP-511 requests this.
> >
> >
> > I will have a look at it for the next release. Planning to start working
> > on it soon.
> >
> > Jörn
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Patch for Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
Turns out I can just put text files in the patch and it will accept
the libraries via 'ivy'. The only remaining problem is that 'jwnl' is
not in Maven.

On Fri, Jun 15, 2012 at 3:35 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 06/06/2012 10:53 AM, Lance Norskog wrote:
>>
>> The opennlp build needs a little upgrading to work with the license
>> validation in the Lucene build. OPENNLP-511 requests this.
>
>
> I will have a look at it for the next release. Planning to start working
> on it soon.
>
> Jörn



-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Jörn Kottmann <ko...@gmail.com>.
On 06/06/2012 10:53 AM, Lance Norskog wrote:
> The opennlp build needs a little upgrading to work with the license
> validation in the Lucene build. OPENNLP-511 requests this.

I will have a look at it for the next release. Planning to start working
on it soon.

Jörn

Re: Patch for Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
It's up!

https://issues.apache.org/jira/browse/LUCENE-2899

It has sentence/tokenizer, pos, chunking and NER. Also some utility
filters to fiddle with payloads. It is smart about caching models.

It is done as a Lucene tokenizer/tokenfilter which is a fairly limiting arena.

The opennlp build needs a little upgrading to work with the license
validation in the Lucene build. OPENNLP-511 requests this.

On Fri, Jun 1, 2012 at 4:18 AM, Svetoslav Marinov
<sv...@findwise.com> wrote:
> At Findwise we active use a number of OpenNLP components with both Hydra
> and OpenPipeline when indexing with Solr.
>
> I look forward to see the result of the patch!
>
> Best,
> Svetoslav
>
> On 2012-05-31 23:10, "Lance Norskog" <go...@gmail.com> wrote:
>
>>Thanks. I have looked at UIMA several times and it seemed very
>>complex. It has a lot of features, is mature, has an Eclipse app
>>builder, etc. I could not keep it all in my head at once. The
>>Solr/Lucene document pipeline features give little space for NLP
>>features. Hydra or OpenPipeline give UIMA and OpenNLP "room to
>>breathe".
>>
>>Are there free annotated text databases for UIMA? OpenNLP does not use
>>any with open licences. It has binary models made from copyrighted
>>annotations and so they cannot be checked into Apache.
>>
>>On Wed, May 30, 2012 at 6:11 PM, Christian Moen <cm...@atilika.com> wrote:
>>> Hello Lance,
>>>
>>> This is very cool!  I'm looking forward to having a look at this.
>>>
>>>
>>> Christian Moen
>>> http://atilika.com
>>>
>>> On May 31, 2012, at 9:54 AM, Lance Norskog wrote:
>>>
>>>> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
>>>> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
>>>> tools are included. The SentenceDetector and Tokenizer are a Lucene
>>>> Tokenizer, and a Lucene TokenFilter takes this stream and runs
>>>> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
>>>> patch includes a couple of handy combinations. For example, make a
>>>> more focused search index by only indexing the nouns & verbs.
>>>>
>>>> Do you have any hints on how to package it? The documentation should
>>>> include how to download and install the models.
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>
>>
>>
>>--
>>Lance Norskog
>>goksron@gmail.com
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Jörn Kottmann <ko...@gmail.com>.
On 05/31/2012 11:10 PM, Lance Norskog wrote:
> Thanks. I have looked at UIMA several times and it seemed very
> complex. It has a lot of features, is mature, has an Eclipse app
> builder, etc. I could not keep it all in my head at once. The
> Solr/Lucene document pipeline features give little space for NLP
> features. Hydra or OpenPipeline give UIMA and OpenNLP "room to
> breathe".

There are many ways to solve this and UIMA is in my opinion
a good solution when you have more complex analysis needs
and/or a huge amount of data. But it also brings its own complexity.

It would definitely be useful to have OpenNLP support in Lucene
filters. +1 from me to implement that.
There will be a couple of use cases for it.

  Jörn

Re: Patch for Lucene/Solr

Posted by Svetoslav Marinov <sv...@findwise.com>.
At Findwise we active use a number of OpenNLP components with both Hydra
and OpenPipeline when indexing with Solr.

I look forward to see the result of the patch!

Best,
Svetoslav

On 2012-05-31 23:10, "Lance Norskog" <go...@gmail.com> wrote:

>Thanks. I have looked at UIMA several times and it seemed very
>complex. It has a lot of features, is mature, has an Eclipse app
>builder, etc. I could not keep it all in my head at once. The
>Solr/Lucene document pipeline features give little space for NLP
>features. Hydra or OpenPipeline give UIMA and OpenNLP "room to
>breathe".
>
>Are there free annotated text databases for UIMA? OpenNLP does not use
>any with open licences. It has binary models made from copyrighted
>annotations and so they cannot be checked into Apache.
>
>On Wed, May 30, 2012 at 6:11 PM, Christian Moen <cm...@atilika.com> wrote:
>> Hello Lance,
>>
>> This is very cool!  I'm looking forward to having a look at this.
>>
>>
>> Christian Moen
>> http://atilika.com
>>
>> On May 31, 2012, at 9:54 AM, Lance Norskog wrote:
>>
>>> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
>>> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
>>> tools are included. The SentenceDetector and Tokenizer are a Lucene
>>> Tokenizer, and a Lucene TokenFilter takes this stream and runs
>>> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
>>> patch includes a couple of handy combinations. For example, make a
>>> more focused search index by only indexing the nouns & verbs.
>>>
>>> Do you have any hints on how to package it? The documentation should
>>> include how to download and install the models.
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>
>
>
>
>-- 
>Lance Norskog
>goksron@gmail.com
>



Re: Patch for Lucene/Solr

Posted by Lance Norskog <go...@gmail.com>.
Thanks. I have looked at UIMA several times and it seemed very
complex. It has a lot of features, is mature, has an Eclipse app
builder, etc. I could not keep it all in my head at once. The
Solr/Lucene document pipeline features give little space for NLP
features. Hydra or OpenPipeline give UIMA and OpenNLP "room to
breathe".

Are there free annotated text databases for UIMA? OpenNLP does not use
any with open licences. It has binary models made from copyrighted
annotations and so they cannot be checked into Apache.

On Wed, May 30, 2012 at 6:11 PM, Christian Moen <cm...@atilika.com> wrote:
> Hello Lance,
>
> This is very cool!  I'm looking forward to having a look at this.
>
>
> Christian Moen
> http://atilika.com
>
> On May 31, 2012, at 9:54 AM, Lance Norskog wrote:
>
>> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
>> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
>> tools are included. The SentenceDetector and Tokenizer are a Lucene
>> Tokenizer, and a Lucene TokenFilter takes this stream and runs
>> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
>> patch includes a couple of handy combinations. For example, make a
>> more focused search index by only indexing the nouns & verbs.
>>
>> Do you have any hints on how to package it? The documentation should
>> include how to download and install the models.
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Patch for Lucene/Solr

Posted by Christian Moen <cm...@atilika.com>.
Hello Lance,

This is very cool!  I'm looking forward to having a look at this.


Christian Moen
http://atilika.com

On May 31, 2012, at 9:54 AM, Lance Norskog wrote:

> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
> tools are included. The SentenceDetector and Tokenizer are a Lucene
> Tokenizer, and a Lucene TokenFilter takes this stream and runs
> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
> patch includes a couple of handy combinations. For example, make a
> more focused search index by only indexing the nouns & verbs.
> 
> Do you have any hints on how to package it? The documentation should
> include how to download and install the models.
> 
> -- 
> Lance Norskog
> goksron@gmail.com