You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by li...@yahoo.com.INVALID on 2015/11/04 21:05:38 UTC

OpenNLP plugin or similar NER software for Solr ??? !!!

Hi everyone, 

I need to install a plugin to extract Location (Country/State/City) from free text documents - any professional advice?!? Does OpenNLP really does the job? Is it English only? US only? Or does it cover worldwide places names?
Could someone help me with this job - installation, configuration, model-training etc?

Please help,Kind regards,Christian
 Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
 

     From: Upayavira <uv...@odoko.co.uk>
 To: solr-user@lucene.apache.org 
 Sent: Tuesday, November 3, 2015 12:13 PM
 Subject: Re: language plugin
   
Looking at the code, this is not going to work without modifications to
Solr (or at least a custom component).

The atomic update code is closely embedded into the Solr
DistributedUpdateProcessor, which expands the atomic update into a full
document and then posts it to the shards.

You need to do the update expansion before your lang detect processor,
but there is no gap between them.

>From my reading of the code, you could create an AtomicUpdateProcessor
that simply expands updates, and insert that before the
LangDetectUpdateProcessor.

Upayavira

On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> Hi
> When I make atomic update - set field - also on content field and also
> another field, the language field became generic. Meaning, it doesn’t
> work in the set field, only in the first inserting. Even if in the first
> time the language was detected, it just became generic after the update.
> Any idea?
> 
> The chain is
> 
> <updateRequestProcessorChain name="aa_chain">
> <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
> <str name="langid.fl">title,content,text</str>
>    <str name="langid.langField">language_t</str>
>    <str name="langid.langsField">language_all_t</str>
>    <str name="langid.fallback">generic</str>
>    <str name="langid.overwrite">false</str> 
>    <str name="langid.threshold">0.8</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> 
> Thanks,
> Shani
> 
> 
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
> Sent: Thursday, October 29, 2015 17:04
> To: solr-user@lucene.apache.org
> Subject: Re: language plugin
> 
> Are you trying to do an atomic update without the content field? If so,
> it sounds like Solr needs an enhancement (bug fix?) so that language
> detection would be skipped if the input field is not present. Or maybe
> that could be an option.
> 
> 
> -- Jack Krupansky
> 
> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
> wrote:
> 
> > Hi,
> >  I'm using solr language detection plugin on field name "content" 
> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > When I'm indexing  on the first time it works fine, but if I want to 
> > set one field again (regardless if it's the content or not) if goes to 
> > its default language. If I'm setting other field I would like the 
> > language to stay the way it was before, and o don't want to insert all 
> > the content again. There is an option to set the plugin that it won't 
> > calculate again the language? (put langid.overwrite to false didn't 
> > work)
> >
> > Thanks,
> > Shani
> >
> >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for 
> > the sole use of the intended recipient(s). Any review or distribution 
> > by others is strictly prohibited. If you are not the intended 
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

   

  

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by simon <mt...@gmail.com>.
https://github.com/OpenSextant/SolrTextTagger/

We're using it for country tagging successfully.

On Wed, Nov 4, 2015 at 3:10 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> David Smiley had a place name and general tagging engine that for the life
> of me I can't find.
>
> It didn't do NER for you (I'm not sure you want to do this in the search
> engine) but it helps you tag entities in a search engine based on a
> predefined list. At least that's what I remember.
>
> On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:
>
> > Hi everyone,
> >
> > I need to install a plugin to extract Location (Country/State/City) from
> > free text documents - any professional advice?!? Does OpenNLP really does
> > the job? Is it English only? US only? Or does it cover worldwide places
> > names?
> > Could someone help me with this job - installation, configuration,
> > model-training etc?
> >
> > Please help,Kind regards,Christian
> >  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
> >
> >
> >      From: Upayavira <uv...@odoko.co.uk>
> >  To: solr-user@lucene.apache.org
> >  Sent: Tuesday, November 3, 2015 12:13 PM
> >  Subject: Re: language plugin
> >
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > > Hi
> > > When I make atomic update - set field - also on content field and also
> > > another field, the language field became generic. Meaning, it doesn’t
> > > work in the set field, only in the first inserting. Even if in the
> first
> > > time the language was detected, it just became generic after the
> update.
> > > Any idea?
> > >
> > > The chain is
> > >
> > > <updateRequestProcessorChain name="aa_chain">
> > > <processor
> > >
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > > <str name="langid.fl">title,content,text</str>
> > >    <str name="langid.langField">language_t</str>
> > >    <str name="langid.langsField">language_all_t</str>
> > >    <str name="langid.fallback">generic</str>
> > >    <str name="langid.overwrite">false</str>
> > >    <str name="langid.threshold">0.8</str>
> > > </processor>
> > > <processor class="solr.LogUpdateProcessorFactory" />
> > >  <processor class="solr.RunUpdateProcessorFactory" />
> > > </updateRequestProcessorChain>
> > >
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > > Sent: Thursday, October 29, 2015 17:04
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: language plugin
> > >
> > > Are you trying to do an atomic update without the content field? If so,
> > > it sounds like Solr needs an enhancement (bug fix?) so that language
> > > detection would be skipped if the input field is not present. Or maybe
> > > that could be an option.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <
> shani.chaushu@intel.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >  I'm using solr language detection plugin on field name "content"
> > > > (solr 4.10, plugin
> LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > > When I'm indexing  on the first time it works fine, but if I want to
> > > > set one field again (regardless if it's the content or not) if goes
> to
> > > > its default language. If I'm setting other field I would like the
> > > > language to stay the way it was before, and o don't want to insert
> all
> > > > the content again. There is an option to set the plugin that it won't
> > > > calculate again the language? (put langid.overwrite to false didn't
> > > > work)
> > > >
> > > > Thanks,
> > > > Shani
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > Intel Electronics Ltd.
> > > >
> > > > This e-mail and any attachments may contain confidential material for
> > > > the sole use of the intended recipient(s). Any review or distribution
> > > > by others is strictly prohibited. If you are not the intended
> > > > recipient, please contact the sender and delete all copies.
> > > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> >
> >
> >
> >
>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by Alessandro Benedetti <ab...@apache.org>.
Apparently this mail thread is duplicated, anyway I will copy and paste my
previous comment as well :

Hi Christian.
This was quite easy to have, since 2011.
But you can complicate this as much as you want.
Or customise it as much as you want.

Take a look :

https://cwiki.apache.org/confluence/display/solr/UIMA+Integration

https://wiki.apache.org/solr/SolrUIMA

This is a good painless starting point.

Then you can complicate the scenario how much you want, developing your own
updateProcessor .
This is a simple customisation and you can decide to use the best location
NER available (
for example I would suggest you to explore :
http://nlp.stanford.edu/software/corenlp.shtml for the open source ones)

Apache Open NLP could be a good choice as well.

Let us know, if this is what you wanted.

Cheers


On 4 November 2015 at 20:10, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> David Smiley had a place name and general tagging engine that for the life
> of me I can't find.
>
> It didn't do NER for you (I'm not sure you want to do this in the search
> engine) but it helps you tag entities in a search engine based on a
> predefined list. At least that's what I remember.
>
> On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:
>
> > Hi everyone,
> >
> > I need to install a plugin to extract Location (Country/State/City) from
> > free text documents - any professional advice?!? Does OpenNLP really does
> > the job? Is it English only? US only? Or does it cover worldwide places
> > names?
> > Could someone help me with this job - installation, configuration,
> > model-training etc?
> >
> > Please help,Kind regards,Christian
> >  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
> >
> >
> >      From: Upayavira <uv...@odoko.co.uk>
> >  To: solr-user@lucene.apache.org
> >  Sent: Tuesday, November 3, 2015 12:13 PM
> >  Subject: Re: language plugin
> >
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > > Hi
> > > When I make atomic update - set field - also on content field and also
> > > another field, the language field became generic. Meaning, it doesn’t
> > > work in the set field, only in the first inserting. Even if in the
> first
> > > time the language was detected, it just became generic after the
> update.
> > > Any idea?
> > >
> > > The chain is
> > >
> > > <updateRequestProcessorChain name="aa_chain">
> > > <processor
> > >
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > > <str name="langid.fl">title,content,text</str>
> > >    <str name="langid.langField">language_t</str>
> > >    <str name="langid.langsField">language_all_t</str>
> > >    <str name="langid.fallback">generic</str>
> > >    <str name="langid.overwrite">false</str>
> > >    <str name="langid.threshold">0.8</str>
> > > </processor>
> > > <processor class="solr.LogUpdateProcessorFactory" />
> > >  <processor class="solr.RunUpdateProcessorFactory" />
> > > </updateRequestProcessorChain>
> > >
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > > Sent: Thursday, October 29, 2015 17:04
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: language plugin
> > >
> > > Are you trying to do an atomic update without the content field? If so,
> > > it sounds like Solr needs an enhancement (bug fix?) so that language
> > > detection would be skipped if the input field is not present. Or maybe
> > > that could be an option.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <
> shani.chaushu@intel.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >  I'm using solr language detection plugin on field name "content"
> > > > (solr 4.10, plugin
> LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > > When I'm indexing  on the first time it works fine, but if I want to
> > > > set one field again (regardless if it's the content or not) if goes
> to
> > > > its default language. If I'm setting other field I would like the
> > > > language to stay the way it was before, and o don't want to insert
> all
> > > > the content again. There is an option to set the plugin that it won't
> > > > calculate again the language? (put langid.overwrite to false didn't
> > > > work)
> > > >
> > > > Thanks,
> > > > Shani
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > Intel Electronics Ltd.
> > > >
> > > > This e-mail and any attachments may contain confidential material for
> > > > the sole use of the intended recipient(s). Any review or distribution
> > > > by others is strictly prohibited. If you are not the intended
> > > > recipient, please contact the sender and delete all copies.
> > > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> >
> >
> >
> >
>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
David Smiley had a place name and general tagging engine that for the life
of me I can't find.

It didn't do NER for you (I'm not sure you want to do this in the search
engine) but it helps you tag entities in a search engine based on a
predefined list. At least that's what I remember.

On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:

> Hi everyone,
>
> I need to install a plugin to extract Location (Country/State/City) from
> free text documents - any professional advice?!? Does OpenNLP really does
> the job? Is it English only? US only? Or does it cover worldwide places
> names?
> Could someone help me with this job - installation, configuration,
> model-training etc?
>
> Please help,Kind regards,Christian
>  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
>
>
>      From: Upayavira <uv...@odoko.co.uk>
>  To: solr-user@lucene.apache.org
>  Sent: Tuesday, November 3, 2015 12:13 PM
>  Subject: Re: language plugin
>
> Looking at the code, this is not going to work without modifications to
> Solr (or at least a custom component).
>
> The atomic update code is closely embedded into the Solr
> DistributedUpdateProcessor, which expands the atomic update into a full
> document and then posts it to the shards.
>
> You need to do the update expansion before your lang detect processor,
> but there is no gap between them.
>
> From my reading of the code, you could create an AtomicUpdateProcessor
> that simply expands updates, and insert that before the
> LangDetectUpdateProcessor.
>
> Upayavira
>
> On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > Hi
> > When I make atomic update - set field - also on content field and also
> > another field, the language field became generic. Meaning, it doesn’t
> > work in the set field, only in the first inserting. Even if in the first
> > time the language was detected, it just became generic after the update.
> > Any idea?
> >
> > The chain is
> >
> > <updateRequestProcessorChain name="aa_chain">
> > <processor
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > <str name="langid.fl">title,content,text</str>
> >    <str name="langid.langField">language_t</str>
> >    <str name="langid.langsField">language_all_t</str>
> >    <str name="langid.fallback">generic</str>
> >    <str name="langid.overwrite">false</str>
> >    <str name="langid.threshold">0.8</str>
> > </processor>
> > <processor class="solr.LogUpdateProcessorFactory" />
> >  <processor class="solr.RunUpdateProcessorFactory" />
> > </updateRequestProcessorChain>
> >
> >
> > Thanks,
> > Shani
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > Sent: Thursday, October 29, 2015 17:04
> > To: solr-user@lucene.apache.org
> > Subject: Re: language plugin
> >
> > Are you trying to do an atomic update without the content field? If so,
> > it sounds like Solr needs an enhancement (bug fix?) so that language
> > detection would be skipped if the input field is not present. Or maybe
> > that could be an option.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <shani.chaushu@intel.com
> >
> > wrote:
> >
> > > Hi,
> > >  I'm using solr language detection plugin on field name "content"
> > > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > When I'm indexing  on the first time it works fine, but if I want to
> > > set one field again (regardless if it's the content or not) if goes to
> > > its default language. If I'm setting other field I would like the
> > > language to stay the way it was before, and o don't want to insert all
> > > the content again. There is an option to set the plugin that it won't
> > > calculate again the language? (put langid.overwrite to false didn't
> > > work)
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> > >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
>
>
>
>




-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.