You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Victor <sc...@yahoo.co.uk> on 2011/10/13 08:38:52 UTC

Multiple search analyzers on the same field type possible?

I would like to do the following in solr/lucene:

For a demo I would like to index a certain field once, but be able to query
it in 2 different ways. The first way is to query the field using a synonym
list and the second way is to query the same field without using a synonym
list. The reason I want to do this is that I want the synonym list to be
flexible and do not want to re-index everything when the list changes. Also,
I want to be able to let the user decide if he/she wants to use the synonym
list while querying.

I had hoped that a solution like this would be possible:

<fieldType name="blabla">
      <analyzer type="index">
...
      </analyzer>
      <analyzer type="query1">
...
      </analyzer>
      <analyzer type="query2">
...
      </analyzer>
    </fieldType>

And then use some kind of parameter in the url to select either query1 or
query2, but this does not seem possible in solr/lucene. 

Maybe I can use a solution using the <copyfield> command, but so far I have
not been successful in getting this to work.

I still hope this is possible, thanks in advance for your help on this.


--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3417898.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Victor van der Wolf <sc...@yahoo.co.uk>.

I don't think this will be a problem. I'll contact you tomorrow directly by
email for some details.


--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3426678.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Erick Erickson <er...@gmail.com>.

Excellent! Can you consider contributing this back? This
is not an unheard-of request.....

Erick

On Fri, Oct 14, 2011 at 5:45 PM, Victor <sc...@yahoo.co.uk> wrote:
> I've spent today writing my own SynonymFilter and SynonymFilterFactory. And
> it works!
>
> I've followed Erick's advice and pre- and postfixed all the words that I
> want to stem with a @. So, if I want to stem the word car, I injest it in
> the query as @car@.
>
> My adapted synonymfilter recognizes the pre/postfixing, removes the @
> characters and continues as usual (which means the synonym filter will do
> what it is supposed to be doing). If no "stemming tags" are found, it aborts
> the synonym lookup part of the process for that token an returns
> immediately.
>
> So:
> car --> car
> cars --> cars
> @car@ --> car and cars
>
> Mission accomplished, no extra storage needed, current index can stay as it
> is, end user can switch between stemming and no stemming when he/she wants
> too.
>
> I think I saved a lot of money today.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3422060.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

I've spent today writing my own SynonymFilter and SynonymFilterFactory. And
it works!

I've followed Erick's advice and pre- and postfixed all the words that I
want to stem with a @. So, if I want to stem the word car, I injest it in
the query as @car@.

My adapted synonymfilter recognizes the pre/postfixing, removes the @
characters and continues as usual (which means the synonym filter will do
what it is supposed to be doing). If no "stemming tags" are found, it aborts
the synonym lookup part of the process for that token an returns
immediately.

So: 
car --> car
cars --> cars
@car@ --> car and cars

Mission accomplished, no extra storage needed, current index can stay as it
is, end user can switch between stemming and no stemming when he/she wants
too.

I think I saved a lot of money today.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3422060.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Victor,

your wages are hopefully more than what costs disk space, nowadays?
I don't want to spoil the fun in thinking of new challenges when it
comes to SOLR, but from a project management point of view I would buy
some more storage and get it done with copyfield and two requesthandlers
that choose the stemmed versus the non-stemmed fields to search on.
(Given that an index is a temporary storage and does not require highest
quality disk RAID systems.)

Well, I'm probably being naive...

To add something valuable to this post:
Maybe you could create two cores that point to the same index. This
might be possible if you use the same index path in both solrconfig.xml?
(I haven't tried it.) Use the exact same schema but with different
synonym files. If one synonym file is empty, and the other contains your
stemming stuff, then querying one core versus querying the other should
have the effect you expect?

No offense,
Chantal


On Fri, 2011-10-14 at 14:36 +0200, Victor wrote:
> Hi Erick,
> 
> First of all, thanks for your posts, I really appreciate this!
> 
> 1) Yes, we have tested alternative stemmers, but I admit that a definite
> decission has not been made yet. Anyway, we definately do not want to create
> a stemmed index because of storage issues and we definately want to be able
> to allow the end-user to turn it on and off. So choosing a different stemmer
> does not solve my problem of wanting to switch between stemming/non-stemming
> without additional indexes.
> 
> 2) Rant granted :) And I definately agree with you. It is always a challenge
> to find a balance between what a customer wants and how far you really want
> to go to in achieving a solution (that does not conflict too much with the
> maintainability of the system).
> 
> But, I do think that the requirements are not that outragious. It seems to
> me reasonable that once you have created an index it would be nice to be
> able to use that index in different ways. After all, the only thing I want
> is apply different query analyzers (mind you, I am not formatting the
> tokens, what could possibly result in index/query token conflicts, I am
> merely expanding query possibilities here by adding synonyms, the rest stays
> the same).
> 
> Another good example could be that you want to index a field that contains
> text in different languages. Would it not be nice then to be able to define
> optimized query analyzers on that field, one for each language? You could
> then access them using the q parameter: q=<field>:<language>:<search term>,
> where <language> is the name of the query analyzer to use. It seems to me to
> be a nice feature. Could be a big change though, because I assume that at
> the moment the analyzers have hard-coded names ("index" and "query").
> 
> 3) Yep, I was also looking into this (because other options seemed to be
> vaporizing). Don't know if I'm going to use suffixes or maybe add a trigger
> word like @stem@. Depends on what the scope of the called method is. I
> prefer the trigger word @stem@ variant because I can then just insert that
> without needing to parse the query string to find out what the actual seach
> words are that I need to suffix.
> 
> Cheers and again, thanks for helping me on this,
> Victor
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3421522.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

Hi Erick,

First of all, thanks for your posts, I really appreciate this!

1) Yes, we have tested alternative stemmers, but I admit that a definite
decission has not been made yet. Anyway, we definately do not want to create
a stemmed index because of storage issues and we definately want to be able
to allow the end-user to turn it on and off. So choosing a different stemmer
does not solve my problem of wanting to switch between stemming/non-stemming
without additional indexes.

2) Rant granted :) And I definately agree with you. It is always a challenge
to find a balance between what a customer wants and how far you really want
to go to in achieving a solution (that does not conflict too much with the
maintainability of the system).

But, I do think that the requirements are not that outragious. It seems to
me reasonable that once you have created an index it would be nice to be
able to use that index in different ways. After all, the only thing I want
is apply different query analyzers (mind you, I am not formatting the
tokens, what could possibly result in index/query token conflicts, I am
merely expanding query possibilities here by adding synonyms, the rest stays
the same).

Another good example could be that you want to index a field that contains
text in different languages. Would it not be nice then to be able to define
optimized query analyzers on that field, one for each language? You could
then access them using the q parameter: q=<field>:<language>:<search term>,
where <language> is the name of the query analyzer to use. It seems to me to
be a nice feature. Could be a big change though, because I assume that at
the moment the analyzers have hard-coded names ("index" and "query").

3) Yep, I was also looking into this (because other options seemed to be
vaporizing). Don't know if I'm going to use suffixes or maybe add a trigger
word like @stem@. Depends on what the scope of the called method is. I
prefer the trigger word @stem@ variant because I can then just insert that
without needing to parse the query string to find out what the actual seach
words are that I need to suffix.

Cheers and again, thanks for helping me on this,
Victor

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3421522.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Erick Erickson <er...@gmail.com>.

Hmmmm....

A couple of things.
1> Have you looked at alternate stemmers? Porter stemmer is rather
     aggressive. Perhaps a less-agressive stemmer would suit your
     internal users.
2> Try a few things, but if you can't solve it reasonably quickly,
      go back to your internal customer and explain the costs of
      fixing this. Really. You're jumping through hoops because
     results "did not please my internal customer". Can they
      quantify their objections? Or is this just looking at the
      results for random searches and guessing at relevance?
      If the latter, you really, really, really need to get them to
      quantify their objections and I bet you'll find that they can't.
      And you'll forever be trying to tweak results to please
      how they feel about it today. Which will be different from
      how they felt about *the exact same results* yesterday.
      You can go around this loop forever.

      We've (programmers in general) done a rather poor job
      historically of laying out the *costs* of fixing things to
      suit a customer and allowing the various stake-holders
      to make rational decisions. We say "Sure, that can be done"
      and leave out "but it will take a month when we won't
      be able to do X, Y, or Z, and requires more hardware".
      There, rant done....

3> I suppose you could think about writing your own filter that
     added the original token and the stemmed token.
     Something like the SynonymFilter but instead of alternate
     versions of the word, you'd have the stemmed version
     and the original version at the same position. Or maybe
     you have the stemmed version and then the original
     version with a special ending character (say $) appended.
     Then you'd have to somehow write a query-time
     analysis chain (or a query parser?) that somehow
     knew enough to use the stemmed or original word (plus $)
     in the query. But I admit I haven't thought this through
     at all. There'd have to be some parameter you passed
     through with the query that controlled whether the
     regular stemming process happened or not... And I
     don't know offhand how that'd work.

     Or reverse that. Append $ to all the stemmed variants.

But really, before going there (which I admit would be more
fun than arguing with your customer), try one of the less
aggressive stemmers. Or see if your other stake-holders
would be better served by not stemming at all. Or....

Best
Erick


On Fri, Oct 14, 2011 at 3:22 AM, Victor <sc...@yahoo.co.uk> wrote:
> Hi Erick,
>
> I work for a very big library and we store huge amounts of data. Indexing
> some of our collections can take days and the index files can get very big.
> We are a non-profit organisation, so we want to provide maximum service to
> our customers but at the same time we are bound to a fixed budget and want
> to keep costs as low as possible (including disk space). Our customers vary
> from academic people that want to do very precise searches to common users
> who want to seach in a more general way. The library now wants to implement
> some form of stemming, but we have had one demo in the past with a stemmer
> that returned results that did not please my internal customer (another
> department).
>
> So my wish list looks like this:
>
> 1) Implement stemming
> 2) Give the end user the possibility to turn stemming on or off for their
> searches
> 3) Have maximum control over the stemmer without the need to reindex if we
> change something there
> 4) Prevent the need for more storage (to keep the operations people happy)
>
> So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at
> query time to apply my stemming. The synonym list I build as follows:
>
> a) load a library (a text file with 1 word per line)
> b) remove stop words from the list
> c) link words that have the same stem
>
> Bullet c) is a little bit more sophisticated, because I do not link words
> that are already part of a pre-defined synonym list that contains
> exceptions.
>
> All this I do to keep maximum control over the behaviour of the stemmer.
> Since this is a demo and it will be used to convince other people in my
> organisation that stemming could be worth implementing, I need to be able to
> adjust its behaviour quickly.
>
> So far processing speed has not been an issue, but disk storage has.
> Generally, at index time we remove as few tokens as possible and our objects
> are complete books, news papers (from 1618 until 1995), etc . So you can
> imagine that our indexes get very, very big.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

Hi Erick,

I work for a very big library and we store huge amounts of data. Indexing
some of our collections can take days and the index files can get very big.
We are a non-profit organisation, so we want to provide maximum service to
our customers but at the same time we are bound to a fixed budget and want
to keep costs as low as possible (including disk space). Our customers vary
from academic people that want to do very precise searches to common users
who want to seach in a more general way. The library now wants to implement
some form of stemming, but we have had one demo in the past with a stemmer
that returned results that did not please my internal customer (another
department).

So my wish list looks like this:

1) Implement stemming
2) Give the end user the possibility to turn stemming on or off for their
searches
3) Have maximum control over the stemmer without the need to reindex if we
change something there
4) Prevent the need for more storage (to keep the operations people happy)

So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at
query time to apply my stemming. The synonym list I build as follows:

a) load a library (a text file with 1 word per line)
b) remove stop words from the list
c) link words that have the same stem

Bullet c) is a little bit more sophisticated, because I do not link words
that are already part of a pre-defined synonym list that contains
exceptions.

All this I do to keep maximum control over the behaviour of the stemmer.
Since this is a demo and it will be used to convince other people in my
organisation that stemming could be worth implementing, I need to be able to
adjust its behaviour quickly.

So far processing speed has not been an issue, but disk storage has.
Generally, at index time we remove as few tokens as possible and our objects
are complete books, news papers (from 1618 until 1995), etc . So you can
imagine that our indexes get very, very big.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Erick Erickson <er...@gmail.com>.

Yes, it will make your existing index larger, but it's, as far as I know,
the only way to effect this. You've outlined the process well.

But this may not be that valuable in the general case. Very frequently,
if people use different analyzers at index and query time, the get into
deep trouble. And in this specific instance, I'm assuming that you're
NOT expanding synonyms at index time, which has its own
problems (see multi-word synonym expansion).

So, one way to handle this is only do index-time expansion, but
boost the term at query time very high. That way, the exact term
will tend to be more valuable for score calculations.

It's not a straightforward problem, perhaps this an XY problem. So
if you could back up a bit and explain what problem you're trying
to solve from a higher-level perhaps other methods would be more
appropriate...

Best
Erick

On Thu, Oct 13, 2011 at 11:11 AM, Victor <sc...@yahoo.co.uk> wrote:
> Sorry Erick, my last post and your's crossed each other.
>
> I am reluctant to use another index (or a multi-value index) since I think
> it will increase the storage I need for those indexes without adding
> functionality (and storage could be an issue for me).
>
> But first let's see if I understand you correctly:
>
> <fieldType name="fieldA">
>      <analyzer type="index">
> ...
>      </analyzer>
>      <analyzer type="query">
> ... no synonyms
>      </analyzer>
>   </fieldType>
>  </types>
>
> <fieldType name="fieldB">
>      <analyzer type="index">
> ... (the same as "fieldA")
>      </analyzer>
>      <analyzer type="query">
> ... (the same as "fieldA" + <filter class="solr.SynonymFilterFactory"/>)
>      </analyzer>
>   </fieldType>
>  </types>
>
> <field name=desc_no_synonyms type="fieldA" indexed="true" stored="true" />
> <field name=desc_yes_synonyms type="fieldB" indexed="true" stored="false" />
>
> <copyfield source="desc_no_synonyms" dest="desc_yes_synonyms"/>
>
> User wants to query without synonyms:
> 1) q=desc_no_synonyms:"hot" fl=desc_no_synonyms
>
> User wants to query with synonyms:
> 2) q=desc_yes_synonyms:"hot" fl=desc_no_synonyms
>
> In case 1) the user gets the description with only "hot" in it,
> in case 2) the user gets the description with "hot" or "warm" in it.
>
> I understand that "fieldB" does not store the contents, but it will create
> an extra index or expand an already existing one, right?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3418874.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

Sorry Erick, my last post and your's crossed each other.

I am reluctant to use another index (or a multi-value index) since I think
it will increase the storage I need for those indexes without adding
functionality (and storage could be an issue for me).

But first let's see if I understand you correctly:

<fieldType name="fieldA">
      <analyzer type="index">
...
      </analyzer>
      <analyzer type="query">
... no synonyms
      </analyzer>
   </fieldType>    
 </types>

<fieldType name="fieldB">
      <analyzer type="index">
... (the same as "fieldA")
      </analyzer>
      <analyzer type="query">
... (the same as "fieldA" + <filter class="solr.SynonymFilterFactory"/>)
      </analyzer>
   </fieldType>    
 </types>

<field name=desc_no_synonyms type="fieldA" indexed="true" stored="true" />
<field name=desc_yes_synonyms type="fieldB" indexed="true" stored="false" />

<copyfield source="desc_no_synonyms" dest="desc_yes_synonyms"/>

User wants to query without synonyms:
1) q=desc_no_synonyms:"hot" fl=desc_no_synonyms

User wants to query with synonyms:
2) q=desc_yes_synonyms:"hot" fl=desc_no_synonyms

In case 1) the user gets the description with only "hot" in it,
in case 2) the user gets the description with "hot" or "warm" in it.

I understand that "fieldB" does not store the contents, but it will create
an extra index or expand an already existing one, right?

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3418874.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

Or, alternatively, it would be nice to link a field to another field so that
it can use the index of that field.

The whole point of different "query analyzers" on the same index would make
the whole solr/lucene more flexible I think. But let's wait and see, maybe
it is possible to do this and I am just missing it.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3418771.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Erick Erickson <er...@gmail.com>.

Why are you reluctant to set ' indexed="true" ' for that field? It
doesn't create another index, it's just another field in your current
index. You can surely set ' stored="false" ' in order not to keep
a copy of the raw data though...

Indexing fields twice is a common option when treating those
fields differently..

Best
Erick

On Thu, Oct 13, 2011 at 10:12 AM, Victor <sc...@yahoo.co.uk> wrote:
> I looked at the copyfield solution and found it not suitable for what I am
> trying for. I defined a new <field> using a <fieldType> that uses a synonym
> filter for the query analyzer. Then I used a <copyfield> command to fill it
> with the data that I want. Since I do not want to create another index I set
> the index parameter to false, "indexed=false". I found that it is impossible
> to query on this field (which is logical, since solr does it's querying
> based on indexes).
>
> I guess what I would need is an expansion of the solr functionality of the q
> parameter, like:
>
> q=<field>:<analyzer>:<search term>
>
> Should I wait for it? :-)
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3418672.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multiple search analyzers on the same field type possible?

Posted by Victor <sc...@yahoo.co.uk>.

I looked at the copyfield solution and found it not suitable for what I am
trying for. I defined a new <field> using a <fieldType> that uses a synonym
filter for the query analyzer. Then I used a <copyfield> command to fill it
with the data that I want. Since I do not want to create another index I set
the index parameter to false, "indexed=false". I found that it is impossible
to query on this field (which is logical, since solr does it's querying
based on indexes).

I guess what I would need is an expansion of the solr functionality of the q
parameter, like:

q=<field>:<analyzer>:<search term>

Should I wait for it? :-)

--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3418672.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple search analyzers on the same field type possible?

Posted by Erick Erickson <er...@gmail.com>.

There's nothing in Solr that'll do this for you that I
know of. The copyfield solution is probably your
best option.

The idea is that you have two field definitions
that use two different field types, one for each
flavor or query analyzer. Then, you can use
<copyField> to copy the field in question into
the second field. Now you use different
forms of the query, something like
q=field1:stuff
or
q=field2:stuff
when you want to make the distinction.

Best
Erick

On Thu, Oct 13, 2011 at 2:38 AM, Victor <sc...@yahoo.co.uk> wrote:
> I would like to do the following in solr/lucene:
>
> For a demo I would like to index a certain field once, but be able to query
> it in 2 different ways. The first way is to query the field using a synonym
> list and the second way is to query the same field without using a synonym
> list. The reason I want to do this is that I want the synonym list to be
> flexible and do not want to re-index everything when the list changes. Also,
> I want to be able to let the user decide if he/she wants to use the synonym
> list while querying.
>
> I had hoped that a solution like this would be possible:
>
> <fieldType name="blabla">
>      <analyzer type="index">
> ...
>      </analyzer>
>      <analyzer type="query1">
> ...
>      </analyzer>
>      <analyzer type="query2">
> ...
>      </analyzer>
>    </fieldType>
>
> And then use some kind of parameter in the url to select either query1 or
> query2, but this does not seem possible in solr/lucene.
>
> Maybe I can use a solution using the <copyfield> command, but so far I have
> not been successful in getting this to work.
>
> I still hope this is possible, thanks in advance for your help on this.
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3417898.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>