You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by maephisto <my...@yahoo.com> on 2013/09/11 12:55:32 UTC

Dynamic analizer settings change

Let's take the following type definition and schema (borrowed from Rafal
Kuc's Solr 4 cookbook) :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
</analyzer>
</fieldType>

and schema:

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="title" type="text" indexed="true" stored="true" />

The above analizer will apply SnowballPorterFilter english language filter. 
But would it be possible to change the language to french during indexing
for some documents. is this possible? If not, what would be the best
solution for having the same analizer but with different languages, which
languange being determined at index time ?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Dynamic analizer settings change

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:maephisto <my...@yahoo.com>
> Sent: Wednesday 11th September 2013 14:34
> To: solr-user@lucene.apache.org
> Subject: Re: Dynamic analizer settings change
> 
> Thanks, Erik!
> 
> I might have missed mentioning something relevant. When querying Solr, I
> wouldn't actually need to query all fields, but only the one corresponding
> to the language picked by the user on the website. If he's using DE, then
> the search should only apply to the text_de field.
> 
> What if I need to work with 50 different languages?
> Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
> text_de, ...): won't this affect the performance ? bigger documents ->
> slower queries.

Yes, that will affect performance greatly! The problem is not searching 50 languages but when using (e)dismax, the problem is creating the entire query.  You will see good performance in the `process` part of a search but poor performance in the `prepare` part of the search when debugging.

> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Dynamic analizer settings change

Posted by Erick Erickson <er...@gmail.com>.

You're still in danger of overly-broad hits. When you
try stemming differently into the _same_ underlying
field you get things that make sense in one language
but are totally bogus in another language matching
the query.

As far as lots and lots of fields is concerned, if you
want to restrict your searches to only one language
you have a couple of choices here....

Consider a different core per language. Solr easily
handles many cores/server. Now you have no
'wasted' space, it just happens that the stemmer for
the core uses the DE-specific stemmers. Which
you can extend to German de-compounding etc.

Alternatively, you can form your queries with some
care. There's nothing that requires, say, edismax to
be specified in solrconfig.xml. Anything you would
put in the defaults section of the config you can
override on the command line. So, for instance,
if you knew you were querying in French, you could
form something like (going from memory)
defType=edismax&qf=title_fr,text_fr
or
&qf=title_de,text_de

and so completely avoid cross-languge searching.

Or you could simply include a field that has the
language and tack on an fq clause like fq=de.

But you haven't told us how big your problem is. I wouldn't
worry at all about efficiency at this stage if you have, say,
10M documents, I'd just try the simplest thing first and
measure.

500M documents is probably another story.

FWIW
Erick

On Wed, Sep 11, 2013 at 9:50 AM, maephisto <my...@yahoo.com> wrote:

> Thanks Jack! Indeed, very nice examples in your book.
>
> Inspired from there, here's a crazy idea: would it be possible to build a
> custom processor chain that would detect the language and use it to apply
> filters, like the aforementioned SnowballPorterFilter.
> That would leave at the end a document having as fields: text(with filtered
> content) and language(the one determined by the processor).
> And at search time, always append the language=<user selected language>.
>
> Does this make sense? If so, would it affect the performance at index time?
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Dynamic analizer settings change

Posted by maephisto <my...@yahoo.com>.

Thanks Jack! Indeed, very nice examples in your book.

Inspired from there, here's a crazy idea: would it be possible to build a
custom processor chain that would detect the language and use it to apply
filters, like the aforementioned SnowballPorterFilter.
That would leave at the end a document having as fields: text(with filtered
content) and language(the one determined by the processor).
And at search time, always append the language=<user selected language>.

Does this make sense? If so, would it affect the performance at index time?
Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic analizer settings change

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yes, supporting multiple languages will be a performance hit, but maybe it 
won't be so bad since all but one of these language-specific fields will be 
empty for each document and Lucene text search should handle empty field 
values just fine. If you can't accept that performance hit, don't support 
multiple languages! It is completely your choice.

There are index-time update processors that can do language detection and 
then automatically direct the text to the proper text_xx field.

See:
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Although my e-book has a lot better examples, especially for the field 
redirection aspect.

-- Jack Krupansky

-----Original Message----- 
From: maephisto
Sent: Wednesday, September 11, 2013 8:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Dynamic analizer settings change

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents ->
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic analizer settings change

Posted by maephisto <my...@yahoo.com>.

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents ->
slower queries.



--
View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dynamic analizer settings change

Posted by Erick Erickson <er...@gmail.com>.

I wouldn't :). Here's the problem. Say you do this successfully at
index time. How do you then search reasonably? There's often
not near enough information to know what the search language is,
there's little or no context.

If the number of languages is limited, people often index into separate
language-specific fields, say title_fr and title_en and use edismax
to automatically distribute queries against all the fields.

Others index "families" of languages in separate fields using things
like the folding filters for Western languages, another field for, say,
CJK languages and another for Middle Eastern languages etc.

FWIW,
Erick

On Wed, Sep 11, 2013 at 6:55 AM, maephisto <my...@yahoo.com> wrote:

> Let's take the following type definition and schema (borrowed from Rafal
> Kuc's Solr 4 cookbook) :
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" language="English"/>
> </analyzer>
> </fieldType>
>
> and schema:
>
> <field name="id" type="string" indexed="true" stored="true"
> required="true" />
> <field name="title" type="text" indexed="true" stored="true" />
>
> The above analizer will apply SnowballPorterFilter english language filter.
> But would it be possible to change the language to french during indexing
> for some documents. is this possible? If not, what would be the best
> solution for having the same analizer but with different languages, which
> languange being determined at index time ?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>