You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Chaushu, Shani" <sh...@intel.com> on 2015/10/29 08:25:13 UTC

language plugin

Hi,
 I'm using solr language detection plugin on field name "content" (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
When I'm indexing  on the first time it works fine, but if I want to set one field again (regardless if it's the content or not) if goes to its default language. If I'm setting other field I would like the language to stay the way it was before, and o don't want to insert all the content again. There is an option to set the plugin that it won't calculate again the language? (put langid.overwrite to false didn't work)

Thanks,
Shani


---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by simon <mt...@gmail.com>.
https://github.com/OpenSextant/SolrTextTagger/

We're using it for country tagging successfully.

On Wed, Nov 4, 2015 at 3:10 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> David Smiley had a place name and general tagging engine that for the life
> of me I can't find.
>
> It didn't do NER for you (I'm not sure you want to do this in the search
> engine) but it helps you tag entities in a search engine based on a
> predefined list. At least that's what I remember.
>
> On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:
>
> > Hi everyone,
> >
> > I need to install a plugin to extract Location (Country/State/City) from
> > free text documents - any professional advice?!? Does OpenNLP really does
> > the job? Is it English only? US only? Or does it cover worldwide places
> > names?
> > Could someone help me with this job - installation, configuration,
> > model-training etc?
> >
> > Please help,Kind regards,Christian
> >  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
> >
> >
> >      From: Upayavira <uv...@odoko.co.uk>
> >  To: solr-user@lucene.apache.org
> >  Sent: Tuesday, November 3, 2015 12:13 PM
> >  Subject: Re: language plugin
> >
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > > Hi
> > > When I make atomic update - set field - also on content field and also
> > > another field, the language field became generic. Meaning, it doesn’t
> > > work in the set field, only in the first inserting. Even if in the
> first
> > > time the language was detected, it just became generic after the
> update.
> > > Any idea?
> > >
> > > The chain is
> > >
> > > <updateRequestProcessorChain name="aa_chain">
> > > <processor
> > >
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > > <str name="langid.fl">title,content,text</str>
> > >    <str name="langid.langField">language_t</str>
> > >    <str name="langid.langsField">language_all_t</str>
> > >    <str name="langid.fallback">generic</str>
> > >    <str name="langid.overwrite">false</str>
> > >    <str name="langid.threshold">0.8</str>
> > > </processor>
> > > <processor class="solr.LogUpdateProcessorFactory" />
> > >  <processor class="solr.RunUpdateProcessorFactory" />
> > > </updateRequestProcessorChain>
> > >
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > > Sent: Thursday, October 29, 2015 17:04
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: language plugin
> > >
> > > Are you trying to do an atomic update without the content field? If so,
> > > it sounds like Solr needs an enhancement (bug fix?) so that language
> > > detection would be skipped if the input field is not present. Or maybe
> > > that could be an option.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <
> shani.chaushu@intel.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >  I'm using solr language detection plugin on field name "content"
> > > > (solr 4.10, plugin
> LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > > When I'm indexing  on the first time it works fine, but if I want to
> > > > set one field again (regardless if it's the content or not) if goes
> to
> > > > its default language. If I'm setting other field I would like the
> > > > language to stay the way it was before, and o don't want to insert
> all
> > > > the content again. There is an option to set the plugin that it won't
> > > > calculate again the language? (put langid.overwrite to false didn't
> > > > work)
> > > >
> > > > Thanks,
> > > > Shani
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > Intel Electronics Ltd.
> > > >
> > > > This e-mail and any attachments may contain confidential material for
> > > > the sole use of the intended recipient(s). Any review or distribution
> > > > by others is strictly prohibited. If you are not the intended
> > > > recipient, please contact the sender and delete all copies.
> > > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> >
> >
> >
> >
>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by Alessandro Benedetti <ab...@apache.org>.
Apparently this mail thread is duplicated, anyway I will copy and paste my
previous comment as well :

Hi Christian.
This was quite easy to have, since 2011.
But you can complicate this as much as you want.
Or customise it as much as you want.

Take a look :

https://cwiki.apache.org/confluence/display/solr/UIMA+Integration

https://wiki.apache.org/solr/SolrUIMA

This is a good painless starting point.

Then you can complicate the scenario how much you want, developing your own
updateProcessor .
This is a simple customisation and you can decide to use the best location
NER available (
for example I would suggest you to explore :
http://nlp.stanford.edu/software/corenlp.shtml for the open source ones)

Apache Open NLP could be a good choice as well.

Let us know, if this is what you wanted.

Cheers


On 4 November 2015 at 20:10, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> David Smiley had a place name and general tagging engine that for the life
> of me I can't find.
>
> It didn't do NER for you (I'm not sure you want to do this in the search
> engine) but it helps you tag entities in a search engine based on a
> predefined list. At least that's what I remember.
>
> On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:
>
> > Hi everyone,
> >
> > I need to install a plugin to extract Location (Country/State/City) from
> > free text documents - any professional advice?!? Does OpenNLP really does
> > the job? Is it English only? US only? Or does it cover worldwide places
> > names?
> > Could someone help me with this job - installation, configuration,
> > model-training etc?
> >
> > Please help,Kind regards,Christian
> >  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
> >
> >
> >      From: Upayavira <uv...@odoko.co.uk>
> >  To: solr-user@lucene.apache.org
> >  Sent: Tuesday, November 3, 2015 12:13 PM
> >  Subject: Re: language plugin
> >
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > > Hi
> > > When I make atomic update - set field - also on content field and also
> > > another field, the language field became generic. Meaning, it doesn’t
> > > work in the set field, only in the first inserting. Even if in the
> first
> > > time the language was detected, it just became generic after the
> update.
> > > Any idea?
> > >
> > > The chain is
> > >
> > > <updateRequestProcessorChain name="aa_chain">
> > > <processor
> > >
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > > <str name="langid.fl">title,content,text</str>
> > >    <str name="langid.langField">language_t</str>
> > >    <str name="langid.langsField">language_all_t</str>
> > >    <str name="langid.fallback">generic</str>
> > >    <str name="langid.overwrite">false</str>
> > >    <str name="langid.threshold">0.8</str>
> > > </processor>
> > > <processor class="solr.LogUpdateProcessorFactory" />
> > >  <processor class="solr.RunUpdateProcessorFactory" />
> > > </updateRequestProcessorChain>
> > >
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > > Sent: Thursday, October 29, 2015 17:04
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: language plugin
> > >
> > > Are you trying to do an atomic update without the content field? If so,
> > > it sounds like Solr needs an enhancement (bug fix?) so that language
> > > detection would be skipped if the input field is not present. Or maybe
> > > that could be an option.
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <
> shani.chaushu@intel.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >  I'm using solr language detection plugin on field name "content"
> > > > (solr 4.10, plugin
> LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > > When I'm indexing  on the first time it works fine, but if I want to
> > > > set one field again (regardless if it's the content or not) if goes
> to
> > > > its default language. If I'm setting other field I would like the
> > > > language to stay the way it was before, and o don't want to insert
> all
> > > > the content again. There is an option to set the plugin that it won't
> > > > calculate again the language? (put langid.overwrite to false didn't
> > > > work)
> > > >
> > > > Thanks,
> > > > Shani
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > Intel Electronics Ltd.
> > > >
> > > > This e-mail and any attachments may contain confidential material for
> > > > the sole use of the intended recipient(s). Any review or distribution
> > > > by others is strictly prohibited. If you are not the intended
> > > > recipient, please contact the sender and delete all copies.
> > > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> >
> >
> >
> >
>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
David Smiley had a place name and general tagging engine that for the life
of me I can't find.

It didn't do NER for you (I'm not sure you want to do this in the search
engine) but it helps you tag entities in a search engine based on a
predefined list. At least that's what I remember.

On Wed, Nov 4, 2015 at 3:05 PM, <li...@yahoo.com.invalid> wrote:

> Hi everyone,
>
> I need to install a plugin to extract Location (Country/State/City) from
> free text documents - any professional advice?!? Does OpenNLP really does
> the job? Is it English only? US only? Or does it cover worldwide places
> names?
> Could someone help me with this job - installation, configuration,
> model-training etc?
>
> Please help,Kind regards,Christian
>  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
>
>
>      From: Upayavira <uv...@odoko.co.uk>
>  To: solr-user@lucene.apache.org
>  Sent: Tuesday, November 3, 2015 12:13 PM
>  Subject: Re: language plugin
>
> Looking at the code, this is not going to work without modifications to
> Solr (or at least a custom component).
>
> The atomic update code is closely embedded into the Solr
> DistributedUpdateProcessor, which expands the atomic update into a full
> document and then posts it to the shards.
>
> You need to do the update expansion before your lang detect processor,
> but there is no gap between them.
>
> From my reading of the code, you could create an AtomicUpdateProcessor
> that simply expands updates, and insert that before the
> LangDetectUpdateProcessor.
>
> Upayavira
>
> On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > Hi
> > When I make atomic update - set field - also on content field and also
> > another field, the language field became generic. Meaning, it doesn’t
> > work in the set field, only in the first inserting. Even if in the first
> > time the language was detected, it just became generic after the update.
> > Any idea?
> >
> > The chain is
> >
> > <updateRequestProcessorChain name="aa_chain">
> > <processor
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > <str name="langid.fl">title,content,text</str>
> >    <str name="langid.langField">language_t</str>
> >    <str name="langid.langsField">language_all_t</str>
> >    <str name="langid.fallback">generic</str>
> >    <str name="langid.overwrite">false</str>
> >    <str name="langid.threshold">0.8</str>
> > </processor>
> > <processor class="solr.LogUpdateProcessorFactory" />
> >  <processor class="solr.RunUpdateProcessorFactory" />
> > </updateRequestProcessorChain>
> >
> >
> > Thanks,
> > Shani
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > Sent: Thursday, October 29, 2015 17:04
> > To: solr-user@lucene.apache.org
> > Subject: Re: language plugin
> >
> > Are you trying to do an atomic update without the content field? If so,
> > it sounds like Solr needs an enhancement (bug fix?) so that language
> > detection would be skipped if the input field is not present. Or maybe
> > that could be an option.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <shani.chaushu@intel.com
> >
> > wrote:
> >
> > > Hi,
> > >  I'm using solr language detection plugin on field name "content"
> > > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > When I'm indexing  on the first time it works fine, but if I want to
> > > set one field again (regardless if it's the content or not) if goes to
> > > its default language. If I'm setting other field I would like the
> > > language to stay the way it was before, and o don't want to insert all
> > > the content again. There is an option to set the plugin that it won't
> > > calculate again the language? (put langid.overwrite to false didn't
> > > work)
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> > >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
>
>
>
>




-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

OpenNLP plugin or similar NER software for Solr ??? !!!

Posted by li...@yahoo.com.INVALID.
Hi everyone, 

I need to install a plugin to extract Location (Country/State/City) from free text documents - any professional advice?!? Does OpenNLP really does the job? Is it English only? US only? Or does it cover worldwide places names?
Could someone help me with this job - installation, configuration, model-training etc?

Please help,Kind regards,Christian
 Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
 

     From: Upayavira <uv...@odoko.co.uk>
 To: solr-user@lucene.apache.org 
 Sent: Tuesday, November 3, 2015 12:13 PM
 Subject: Re: language plugin
   
Looking at the code, this is not going to work without modifications to
Solr (or at least a custom component).

The atomic update code is closely embedded into the Solr
DistributedUpdateProcessor, which expands the atomic update into a full
document and then posts it to the shards.

You need to do the update expansion before your lang detect processor,
but there is no gap between them.

>From my reading of the code, you could create an AtomicUpdateProcessor
that simply expands updates, and insert that before the
LangDetectUpdateProcessor.

Upayavira

On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> Hi
> When I make atomic update - set field - also on content field and also
> another field, the language field became generic. Meaning, it doesn’t
> work in the set field, only in the first inserting. Even if in the first
> time the language was detected, it just became generic after the update.
> Any idea?
> 
> The chain is
> 
> <updateRequestProcessorChain name="aa_chain">
> <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
> <str name="langid.fl">title,content,text</str>
>    <str name="langid.langField">language_t</str>
>    <str name="langid.langsField">language_all_t</str>
>    <str name="langid.fallback">generic</str>
>    <str name="langid.overwrite">false</str> 
>    <str name="langid.threshold">0.8</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> 
> Thanks,
> Shani
> 
> 
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
> Sent: Thursday, October 29, 2015 17:04
> To: solr-user@lucene.apache.org
> Subject: Re: language plugin
> 
> Are you trying to do an atomic update without the content field? If so,
> it sounds like Solr needs an enhancement (bug fix?) so that language
> detection would be skipped if the input field is not present. Or maybe
> that could be an option.
> 
> 
> -- Jack Krupansky
> 
> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
> wrote:
> 
> > Hi,
> >  I'm using solr language detection plugin on field name "content" 
> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > When I'm indexing  on the first time it works fine, but if I want to 
> > set one field again (regardless if it's the content or not) if goes to 
> > its default language. If I'm setting other field I would like the 
> > language to stay the way it was before, and o don't want to insert all 
> > the content again. There is an option to set the plugin that it won't 
> > calculate again the language? (put langid.overwrite to false didn't 
> > work)
> >
> > Thanks,
> > Shani
> >
> >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for 
> > the sole use of the intended recipient(s). Any review or distribution 
> > by others is strictly prohibited. If you are not the intended 
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

   

  

Re: OpenNLP plugin or similar NER software for Solr

Posted by Alessandro Benedetti <ab...@apache.org>.
Hi Christian.
This was quite easy to have, since 2011.
But you can complicate this as much as you want.
Or customise it as much as you want.

Take a look :

https://cwiki.apache.org/confluence/display/solr/UIMA+Integration

https://wiki.apache.org/solr/SolrUIMA

This is a good painless starting point.

Then you can complicate the scenario how much you want, developing your own
updateProcessor .
This is a simple customisation and you can decide to use the best location
NER available (
for example I would suggest you to explore :
http://nlp.stanford.edu/software/corenlp.shtml for the open source ones)

Apache Open NLP could be a good choice as well.

Let us know, if this is what you wanted.

Cheers


On 3 November 2015 at 12:04, <li...@yahoo.com.invalid> wrote:

> Hi everyone,
>
> I need to install a plugin to extract Location (Country/State/City) from
> free text documents - any professional advice?!? Does OpenNLP really does
> the job? Is it English only? US only? Or does it cover worldwide places
> names?
> Could someone help me with this job - installation, configuration,
> model-training etc?
>
> Please help,Kind regards,Christian
>  Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
>       From: Upayavira <uv...@odoko.co.uk>
>  To: solr-user@lucene.apache.org
>  Sent: Tuesday, November 3, 2015 12:13 PM
>  Subject: Re: language plugin
>
> Looking at the code, this is not going to work without modifications to
> Solr (or at least a custom component).
>
> The atomic update code is closely embedded into the Solr
> DistributedUpdateProcessor, which expands the atomic update into a full
> document and then posts it to the shards.
>
> You need to do the update expansion before your lang detect processor,
> but there is no gap between them.
>
> From my reading of the code, you could create an AtomicUpdateProcessor
> that simply expands updates, and insert that before the
> LangDetectUpdateProcessor.
>
> Upayavira
>
> On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> > Hi
> > When I make atomic update - set field - also on content field and also
> > another field, the language field became generic. Meaning, it doesn’t
> > work in the set field, only in the first inserting. Even if in the first
> > time the language was detected, it just became generic after the update.
> > Any idea?
> >
> > The chain is
> >
> > <updateRequestProcessorChain name="aa_chain">
> > <processor
> >
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> > <str name="langid.fl">title,content,text</str>
> >    <str name="langid.langField">language_t</str>
> >    <str name="langid.langsField">language_all_t</str>
> >    <str name="langid.fallback">generic</str>
> >    <str name="langid.overwrite">false</str>
> >    <str name="langid.threshold">0.8</str>
> > </processor>
> > <processor class="solr.LogUpdateProcessorFactory" />
> >  <processor class="solr.RunUpdateProcessorFactory" />
> > </updateRequestProcessorChain>
> >
> >
> > Thanks,
> > Shani
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> > Sent: Thursday, October 29, 2015 17:04
> > To: solr-user@lucene.apache.org
> > Subject: Re: language plugin
> >
> > Are you trying to do an atomic update without the content field? If so,
> > it sounds like Solr needs an enhancement (bug fix?) so that language
> > detection would be skipped if the input field is not present. Or maybe
> > that could be an option.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <shani.chaushu@intel.com
> >
> > wrote:
> >
> > > Hi,
> > >  I'm using solr language detection plugin on field name "content"
> > > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > > When I'm indexing  on the first time it works fine, but if I want to
> > > set one field again (regardless if it's the content or not) if goes to
> > > its default language. If I'm setting other field I would like the
> > > language to stay the way it was before, and o don't want to insert all
> > > the content again. There is an option to set the plugin that it won't
> > > calculate again the language? (put langid.overwrite to false didn't
> > > work)
> > >
> > > Thanks,
> > > Shani
> > >
> > >
> > > ---------------------------------------------------------------------
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material for
> > > the sole use of the intended recipient(s). Any review or distribution
> > > by others is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender and delete all copies.
> > >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
>
>




-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

OpenNLP plugin or similar NER software for Solr

Posted by li...@yahoo.com.INVALID.
Hi everyone, 

I need to install a plugin to extract Location (Country/State/City) from free text documents - any professional advice?!? Does OpenNLP really does the job? Is it English only? US only? Or does it cover worldwide places names?
Could someone help me with this job - installation, configuration, model-training etc?

Please help,Kind regards,Christian
 Christian Fotache Tel: 0728.297.207 Fax: 0351.411.570
      From: Upayavira <uv...@odoko.co.uk>
 To: solr-user@lucene.apache.org 
 Sent: Tuesday, November 3, 2015 12:13 PM
 Subject: Re: language plugin
   
Looking at the code, this is not going to work without modifications to
Solr (or at least a custom component).

The atomic update code is closely embedded into the Solr
DistributedUpdateProcessor, which expands the atomic update into a full
document and then posts it to the shards.

You need to do the update expansion before your lang detect processor,
but there is no gap between them.

>From my reading of the code, you could create an AtomicUpdateProcessor
that simply expands updates, and insert that before the
LangDetectUpdateProcessor.

Upayavira

On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> Hi
> When I make atomic update - set field - also on content field and also
> another field, the language field became generic. Meaning, it doesn’t
> work in the set field, only in the first inserting. Even if in the first
> time the language was detected, it just became generic after the update.
> Any idea?
> 
> The chain is
> 
> <updateRequestProcessorChain name="aa_chain">
> <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
> <str name="langid.fl">title,content,text</str>
>    <str name="langid.langField">language_t</str>
>    <str name="langid.langsField">language_all_t</str>
>    <str name="langid.fallback">generic</str>
>    <str name="langid.overwrite">false</str> 
>    <str name="langid.threshold">0.8</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> 
> Thanks,
> Shani
> 
> 
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
> Sent: Thursday, October 29, 2015 17:04
> To: solr-user@lucene.apache.org
> Subject: Re: language plugin
> 
> Are you trying to do an atomic update without the content field? If so,
> it sounds like Solr needs an enhancement (bug fix?) so that language
> detection would be skipped if the input field is not present. Or maybe
> that could be an option.
> 
> 
> -- Jack Krupansky
> 
> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
> wrote:
> 
> > Hi,
> >  I'm using solr language detection plugin on field name "content" 
> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > When I'm indexing  on the first time it works fine, but if I want to 
> > set one field again (regardless if it's the content or not) if goes to 
> > its default language. If I'm setting other field I would like the 
> > language to stay the way it was before, and o don't want to insert all 
> > the content again. There is an option to set the plugin that it won't 
> > calculate again the language? (put langid.overwrite to false didn't 
> > work)
> >
> > Thanks,
> > Shani
> >
> >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for 
> > the sole use of the intended recipient(s). Any review or distribution 
> > by others is strictly prohibited. If you are not the intended 
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

  

Re: language plugin

Posted by Upayavira <uv...@odoko.co.uk>.
Actually, you are right. It would be executed on every node if you put
LandDetect after a deliberately inserted
DistrubutedUpdateProcessorFactory entry.

Not optimal, but would work.

Upayavira

On Tue, Nov 3, 2015, at 12:26 PM, Alexandre Rafalovitch wrote:
> I wonder what would happen if the DistributedUpdateProcessorFactory is
> manually added into the chain and the LangDetect definition is moved
> AFTER it. As per
> https://wiki.apache.org/solr/UpdateRequestProcessor#Distributed_Updates
> 
> This would mean that the detection code would be executed on each
> node, but with the record expanded to include those other fields
> (assuming they were stored). This may do the trick, though a custom
> URP would probably be a better solution anyway.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
> 
> 
> On 3 November 2015 at 05:13, Upayavira <uv...@odoko.co.uk> wrote:
> > Looking at the code, this is not going to work without modifications to
> > Solr (or at least a custom component).
> >
> > The atomic update code is closely embedded into the Solr
> > DistributedUpdateProcessor, which expands the atomic update into a full
> > document and then posts it to the shards.
> >
> > You need to do the update expansion before your lang detect processor,
> > but there is no gap between them.
> >
> > From my reading of the code, you could create an AtomicUpdateProcessor
> > that simply expands updates, and insert that before the
> > LangDetectUpdateProcessor.
> >
> > Upayavira
> >
> > On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> >> Hi
> >> When I make atomic update - set field - also on content field and also
> >> another field, the language field became generic. Meaning, it doesn’t
> >> work in the set field, only in the first inserting. Even if in the first
> >> time the language was detected, it just became generic after the update.
> >> Any idea?
> >>
> >> The chain is
> >>
> >> <updateRequestProcessorChain name="aa_chain">
> >> <processor
> >> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> >> <str name="langid.fl">title,content,text</str>
> >>     <str name="langid.langField">language_t</str>
> >>     <str name="langid.langsField">language_all_t</str>
> >>     <str name="langid.fallback">generic</str>
> >>     <str name="langid.overwrite">false</str>
> >>     <str name="langid.threshold">0.8</str>
> >> </processor>
> >> <processor class="solr.LogUpdateProcessorFactory" />
> >>   <processor class="solr.RunUpdateProcessorFactory" />
> >> </updateRequestProcessorChain>
> >>
> >>
> >> Thanks,
> >> Shani
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> >> Sent: Thursday, October 29, 2015 17:04
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: language plugin
> >>
> >> Are you trying to do an atomic update without the content field? If so,
> >> it sounds like Solr needs an enhancement (bug fix?) so that language
> >> detection would be skipped if the input field is not present. Or maybe
> >> that could be an option.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
> >> wrote:
> >>
> >> > Hi,
> >> >  I'm using solr language detection plugin on field name "content"
> >> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> >> > When I'm indexing  on the first time it works fine, but if I want to
> >> > set one field again (regardless if it's the content or not) if goes to
> >> > its default language. If I'm setting other field I would like the
> >> > language to stay the way it was before, and o don't want to insert all
> >> > the content again. There is an option to set the plugin that it won't
> >> > calculate again the language? (put langid.overwrite to false didn't
> >> > work)
> >> >
> >> > Thanks,
> >> > Shani
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > Intel Electronics Ltd.
> >> >
> >> > This e-mail and any attachments may contain confidential material for
> >> > the sole use of the intended recipient(s). Any review or distribution
> >> > by others is strictly prohibited. If you are not the intended
> >> > recipient, please contact the sender and delete all copies.
> >> >
> >> ---------------------------------------------------------------------
> >> Intel Electronics Ltd.
> >>
> >> This e-mail and any attachments may contain confidential material for
> >> the sole use of the intended recipient(s). Any review or distribution
> >> by others is strictly prohibited. If you are not the intended
> >> recipient, please contact the sender and delete all copies.

Re: language plugin

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I wonder what would happen if the DistributedUpdateProcessorFactory is
manually added into the chain and the LangDetect definition is moved
AFTER it. As per
https://wiki.apache.org/solr/UpdateRequestProcessor#Distributed_Updates

This would mean that the detection code would be executed on each
node, but with the record expanded to include those other fields
(assuming they were stored). This may do the trick, though a custom
URP would probably be a better solution anyway.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 November 2015 at 05:13, Upayavira <uv...@odoko.co.uk> wrote:
> Looking at the code, this is not going to work without modifications to
> Solr (or at least a custom component).
>
> The atomic update code is closely embedded into the Solr
> DistributedUpdateProcessor, which expands the atomic update into a full
> document and then posts it to the shards.
>
> You need to do the update expansion before your lang detect processor,
> but there is no gap between them.
>
> From my reading of the code, you could create an AtomicUpdateProcessor
> that simply expands updates, and insert that before the
> LangDetectUpdateProcessor.
>
> Upayavira
>
> On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
>> Hi
>> When I make atomic update - set field - also on content field and also
>> another field, the language field became generic. Meaning, it doesn’t
>> work in the set field, only in the first inserting. Even if in the first
>> time the language was detected, it just became generic after the update.
>> Any idea?
>>
>> The chain is
>>
>> <updateRequestProcessorChain name="aa_chain">
>> <processor
>> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>> <str name="langid.fl">title,content,text</str>
>>     <str name="langid.langField">language_t</str>
>>     <str name="langid.langsField">language_all_t</str>
>>     <str name="langid.fallback">generic</str>
>>     <str name="langid.overwrite">false</str>
>>     <str name="langid.threshold">0.8</str>
>> </processor>
>> <processor class="solr.LogUpdateProcessorFactory" />
>>   <processor class="solr.RunUpdateProcessorFactory" />
>> </updateRequestProcessorChain>
>>
>>
>> Thanks,
>> Shani
>>
>>
>>
>>
>> -----Original Message-----
>> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
>> Sent: Thursday, October 29, 2015 17:04
>> To: solr-user@lucene.apache.org
>> Subject: Re: language plugin
>>
>> Are you trying to do an atomic update without the content field? If so,
>> it sounds like Solr needs an enhancement (bug fix?) so that language
>> detection would be skipped if the input field is not present. Or maybe
>> that could be an option.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
>> wrote:
>>
>> > Hi,
>> >  I'm using solr language detection plugin on field name "content"
>> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
>> > When I'm indexing  on the first time it works fine, but if I want to
>> > set one field again (regardless if it's the content or not) if goes to
>> > its default language. If I'm setting other field I would like the
>> > language to stay the way it was before, and o don't want to insert all
>> > the content again. There is an option to set the plugin that it won't
>> > calculate again the language? (put langid.overwrite to false didn't
>> > work)
>> >
>> > Thanks,
>> > Shani
>> >
>> >
>> > ---------------------------------------------------------------------
>> > Intel Electronics Ltd.
>> >
>> > This e-mail and any attachments may contain confidential material for
>> > the sole use of the intended recipient(s). Any review or distribution
>> > by others is strictly prohibited. If you are not the intended
>> > recipient, please contact the sender and delete all copies.
>> >
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.

Re: language plugin

Posted by Upayavira <uv...@odoko.co.uk>.
Looking at the code, this is not going to work without modifications to
Solr (or at least a custom component).

The atomic update code is closely embedded into the Solr
DistributedUpdateProcessor, which expands the atomic update into a full
document and then posts it to the shards.

You need to do the update expansion before your lang detect processor,
but there is no gap between them.

>From my reading of the code, you could create an AtomicUpdateProcessor
that simply expands updates, and insert that before the
LangDetectUpdateProcessor.

Upayavira

On Tue, Nov 3, 2015, at 06:38 AM, Chaushu, Shani wrote:
> Hi
> When I make atomic update - set field - also on content field and also
> another field, the language field became generic. Meaning, it doesn’t
> work in the set field, only in the first inserting. Even if in the first
> time the language was detected, it just became generic after the update.
> Any idea?
> 
> The chain is
> 
> <updateRequestProcessorChain name="aa_chain">
> <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
> <str name="langid.fl">title,content,text</str>
>     <str name="langid.langField">language_t</str>
>     <str name="langid.langsField">language_all_t</str>
>     <str name="langid.fallback">generic</str>
>     <str name="langid.overwrite">false</str> 
>     <str name="langid.threshold">0.8</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
>   <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> 
> 
> Thanks,
> Shani
> 
> 
> 
> 
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
> Sent: Thursday, October 29, 2015 17:04
> To: solr-user@lucene.apache.org
> Subject: Re: language plugin
> 
> Are you trying to do an atomic update without the content field? If so,
> it sounds like Solr needs an enhancement (bug fix?) so that language
> detection would be skipped if the input field is not present. Or maybe
> that could be an option.
> 
> 
> -- Jack Krupansky
> 
> On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
> wrote:
> 
> > Hi,
> >  I'm using solr language detection plugin on field name "content" 
> > (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> > When I'm indexing  on the first time it works fine, but if I want to 
> > set one field again (regardless if it's the content or not) if goes to 
> > its default language. If I'm setting other field I would like the 
> > language to stay the way it was before, and o don't want to insert all 
> > the content again. There is an option to set the plugin that it won't 
> > calculate again the language? (put langid.overwrite to false didn't 
> > work)
> >
> > Thanks,
> > Shani
> >
> >
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for 
> > the sole use of the intended recipient(s). Any review or distribution 
> > by others is strictly prohibited. If you are not the intended 
> > recipient, please contact the sender and delete all copies.
> >
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

RE: language plugin

Posted by "Chaushu, Shani" <sh...@intel.com>.
Hi
When I make atomic update - set field - also on content field and also another field, the language field became generic. Meaning, it doesn’t work in the set field, only in the first inserting. Even if in the first time the language was detected, it just became generic after the update.
Any idea?

The chain is

<updateRequestProcessorChain name="aa_chain">
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">      
<str name="langid.fl">title,content,text</str>
    <str name="langid.langField">language_t</str>
    <str name="langid.langsField">language_all_t</str>
    <str name="langid.fallback">generic</str>
    <str name="langid.overwrite">false</str> 
    <str name="langid.threshold">0.8</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>


Thanks,
Shani




-----Original Message-----
From: Jack Krupansky [mailto:jack.krupansky@gmail.com] 
Sent: Thursday, October 29, 2015 17:04
To: solr-user@lucene.apache.org
Subject: Re: language plugin

Are you trying to do an atomic update without the content field? If so, it sounds like Solr needs an enhancement (bug fix?) so that language detection would be skipped if the input field is not present. Or maybe that could be an option.


-- Jack Krupansky

On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
wrote:

> Hi,
>  I'm using solr language detection plugin on field name "content" 
> (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> When I'm indexing  on the first time it works fine, but if I want to 
> set one field again (regardless if it's the content or not) if goes to 
> its default language. If I'm setting other field I would like the 
> language to stay the way it was before, and o don't want to insert all 
> the content again. There is an option to set the plugin that it won't 
> calculate again the language? (put langid.overwrite to false didn't 
> work)
>
> Thanks,
> Shani
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: language plugin

Posted by Jack Krupansky <ja...@gmail.com>.
Are you trying to do an atomic update without the content field? If so, it
sounds like Solr needs an enhancement (bug fix?) so that language detection
would be skipped if the input field is not present. Or maybe that could be
an option.


-- Jack Krupansky

On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani <sh...@intel.com>
wrote:

> Hi,
>  I'm using solr language detection plugin on field name "content" (solr
> 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> When I'm indexing  on the first time it works fine, but if I want to set
> one field again (regardless if it's the content or not) if goes to its
> default language. If I'm setting other field I would like the language to
> stay the way it was before, and o don't want to insert all the content
> again. There is an option to set the plugin that it won't calculate again
> the language? (put langid.overwrite to false didn't work)
>
> Thanks,
> Shani
>
>
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: language plugin

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Could you post your full chain definition. It's an interesting
problem, but hard to answer without seeing exact current
configuration.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 29 October 2015 at 03:25, Chaushu, Shani <sh...@intel.com> wrote:
> Hi,
>  I'm using solr language detection plugin on field name "content" (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> When I'm indexing  on the first time it works fine, but if I want to set one field again (regardless if it's the content or not) if goes to its default language. If I'm setting other field I would like the language to stay the way it was before, and o don't want to insert all the content again. There is an option to set the plugin that it won't calculate again the language? (put langid.overwrite to false didn't work)
>
> Thanks,
> Shani