You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2009/06/03 00:55:27 UTC

Is there Downside to a huge synonyms file?

In my index i have legal faqs, forms, legal videos etc with a state field for
each resource.
Now if i search for real estate san diego, I want to be able to return other
'california' results i.e results from san francisco.
I have the following fields in the index

title                                                  state          
description...
real estate san diego example 1           california         some
description
real estate carlsbad example 2             california         some desc

so when i search for real estate san francisco, since there is no match, i
want to be able to return the other real estate results in california
instead of returning none. Because sometimes they might be searching for a
real estate form and city probably doesn't matter. 

I have two things in mind. One is adding a synonym mapping
san diego, california
carlsbad, california
san francisco, california

(which probably isn't the best way)
hoping that search for san francisco real estate would map san francisco to
california and hence return the other two california results

OR

adding the mapping of city to state in the index itself like..

title                                         state             city                                  
description...
real estate san diego eg 1    california   carlsbad, san francisco, san
diego        some description
real estate carlsbad eg 2      california   carlsbad, san francisco, san
diego        some description

which of the above two is better. Does a huge synonym file affect
performance. Or Is there a even better way? I'm sure there is but I can't
put my finger on it yet & I'm not familiar with java either.

-- 
View this message in context: http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23842527.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Downside to a huge synonyms file?

Posted by anuvenk <an...@hotmail.com>.
A small addition to my earlier post. I wonder if its because of the 'mm'
param, which requires that until 3 words in search phrase, all the words
should be matched. If i alter this now, i'd get ir-relevant results for a
lot of popular 1, 2, 3 word search terms. How to solve for this? 

anuvenk wrote:
> 
> I tried adding some city to state mappings in the synonyms file. I'm using
> the dismax handler for phrase matching. So as & when i add more & more
> city to state mappings, I end up with zero results for state based
> searches.
> Eg: ca,california,los angeles
>      ca,california,san diego
>      ca,california,san francisco
>      ca,california,burbank    and so on....
> now a city based search returns a few other california results but a state
> based search like dui california is returning zero results. 
> I checked the parsedquery_toString and I see no 'OR' although the default
> operator is 'OR' in schema. It looks like its trying to find matches for
> all those cities as they are mapped to 'california' and hence returns zero
> results. How to force dismax to use 'OR' and not 'AND' even though the
> schema has 'OR'.
> Or is this how dismax works? Can someone explain how to overcome this
> problem. 
> Here is my custom request handler that extends dismax
> <requestHandler name="qfacet" class="solr.DisMaxRequestHandler" >
>     <lst name="defaults">
>      <str name="echoParams">explicit</str>
>      <float name="tie">0.01</float>
>      <str name="qf">name^2.0 text^0.8</str>
>      <!-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 -
> 5 shld match; above 6 - 90% match -->
>      <str name="mm">3&lt;-1 4&lt;-1 5&lt;-1 6&lt;90%</str>
>      <str name="pf">
>          text^0.8 name^2.0
>      </str>
>      <int name="qs">4</int>
>      <int name="ps">4</int>
>      <str name="fl">
>              *,score
>      </str>  
> 
>     </lst>
>     <lst name="invariants">
>       <!--<str name="facet.field">resourceType</str>
>       <str name="facet.field">category</str>
>       <str name="facet.field">stateName</str>-->
>       <str name="facet.sort">false</str>
>       <int name="facet.mincount">1</int>
>     </lst>
>   </requestHandler>
> 
> Thanks.
> 
> 
> 
> Otis Gospodnetic wrote:
>> 
>> 
>> Hello,
>> 
>> 300K is a pretty small index.  I wouldn't worry about the number of
>> synonyms unless you are turning a single term into dozens of ORed terms.
>> 
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
>> 
>> ----- Original Message ----
>>> From: anuvenk <an...@hotmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Tuesday, June 2, 2009 11:28:43 PM
>>> Subject: Re: Is there Downside to a huge synonyms file?
>>> 
>>> 
>>> I'm using query time synonyms. I have more fields in my index though.
>>> This is
>>> just an example or sample of data from my index. Yes, we don't have
>>> millions
>>> of documents. Could be around 300,000 and might increase in future. The
>>> reason i'm using query time synonyms is because of the nature of my
>>> data. I
>>> can't re-index the data everytime i add or remove a synonym. But for
>>> this
>>> particular requirement is it best to have index time synonyms because of
>>> the
>>> multi-word synonym nature. Again if i add more cities list to the
>>> synonym
>>> file, I can't be re-indexing all the data over and over again. 
>>> 
>>> 
>>> 
>>> anuvenk wrote:
>>> > 
>>> > In my index i have legal faqs, forms, legal videos etc with a state
>>> field
>>> > for each resource.
>>> > Now if i search for real estate san diego, I want to be able to return
>>> > other 'california' results i.e results from san francisco.
>>> > I have the following fields in the index
>>> > 
>>> > title                                                  state          
>>> > description...
>>> > real estate san diego example 1           california         some
>>> > description
>>> > real estate carlsbad example 2             california         some
>>> desc
>>> > 
>>> > so when i search for real estate san francisco, since there is no
>>> match, i
>>> > want to be able to return the other real estate results in california
>>> > instead of returning none. Because sometimes they might be searching
>>> for a
>>> > real estate form and city probably doesn't matter. 
>>> > 
>>> > I have two things in mind. One is adding a synonym mapping
>>> > san diego, california
>>> > carlsbad, california
>>> > san francisco, california
>>> > 
>>> > (which probably isn't the best way)
>>> > hoping that search for san francisco real estate would map san
>>> francisco
>>> > to california and hence return the other two california results
>>> > 
>>> > OR
>>> > 
>>> > adding the mapping of city to state in the index itself like..
>>> > 
>>> > title                                         state             city          
>>>                         
>>> > description...
>>> > real estate san diego eg 1    california   carlsbad, san francisco,
>>> san
>>> > diego        some description
>>> > real estate carlsbad eg 2      california   carlsbad, san francisco,
>>> san
>>> > diego        some description
>>> > 
>>> > which of the above two is better. Does a huge synonym file affect
>>> > performance. Or Is there a even better way? I'm sure there is but I
>>> can't
>>> > put my finger on it yet & I'm not familiar with java either.
>>> > 
>>> > 
>>> 
>>> -- 
>>> View this message in context: 
>>> http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23861649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Downside to a huge synonyms file?

Posted by anuvenk <an...@hotmail.com>.
I tried adding some city to state mappings in the synonyms file. I'm using
the dismax handler for phrase matching. So as & when i add more & more city
to state mappings, I end up with zero results for state based searches.
Eg: ca,california,los angeles
     ca,california,san diego
     ca,california,san francisco
     ca,california,burbank    and so on....
now a city based search returns a few other california results but a state
based search like dui california is returning zero results. 
I checked the parsedquery_toString and I see no 'OR' although the default
operator is 'OR' in schema. It looks like its trying to find matches for all
those cities as they are mapped to 'california' and hence returns zero
results. How to force dismax to use 'OR' and not 'AND' even though the
schema has 'OR'.
Or is this how dismax works? Can someone explain how to overcome this
problem. 
Here is my custom request handler that extends dismax
<requestHandler name="qfacet" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <str name="echoParams">explicit</str>
     <float name="tie">0.01</float>
     <str name="qf">name^2.0 text^0.8</str>
     <!-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 - 5
shld match; above 6 - 90% match -->
     <str name="mm">3&lt;-1 4&lt;-1 5&lt;-1 6&lt;90%</str>
     <str name="pf">
         text^0.8 name^2.0
     </str>
     <int name="qs">4</int>
     <int name="ps">4</int>
     <str name="fl">
             *,score
     </str>  

    </lst>
    <lst name="invariants">
      <!--<str name="facet.field">resourceType</str>
      <str name="facet.field">category</str>
      <str name="facet.field">stateName</str>-->
      <str name="facet.sort">false</str>
      <int name="facet.mincount">1</int>
    </lst>
  </requestHandler>

Thanks.



Otis Gospodnetic wrote:
> 
> 
> Hello,
> 
> 300K is a pretty small index.  I wouldn't worry about the number of
> synonyms unless you are turning a single term into dozens of ORed terms.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: anuvenk <an...@hotmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, June 2, 2009 11:28:43 PM
>> Subject: Re: Is there Downside to a huge synonyms file?
>> 
>> 
>> I'm using query time synonyms. I have more fields in my index though.
>> This is
>> just an example or sample of data from my index. Yes, we don't have
>> millions
>> of documents. Could be around 300,000 and might increase in future. The
>> reason i'm using query time synonyms is because of the nature of my data.
>> I
>> can't re-index the data everytime i add or remove a synonym. But for this
>> particular requirement is it best to have index time synonyms because of
>> the
>> multi-word synonym nature. Again if i add more cities list to the synonym
>> file, I can't be re-indexing all the data over and over again. 
>> 
>> 
>> 
>> anuvenk wrote:
>> > 
>> > In my index i have legal faqs, forms, legal videos etc with a state
>> field
>> > for each resource.
>> > Now if i search for real estate san diego, I want to be able to return
>> > other 'california' results i.e results from san francisco.
>> > I have the following fields in the index
>> > 
>> > title                                                  state          
>> > description...
>> > real estate san diego example 1           california         some
>> > description
>> > real estate carlsbad example 2             california         some desc
>> > 
>> > so when i search for real estate san francisco, since there is no
>> match, i
>> > want to be able to return the other real estate results in california
>> > instead of returning none. Because sometimes they might be searching
>> for a
>> > real estate form and city probably doesn't matter. 
>> > 
>> > I have two things in mind. One is adding a synonym mapping
>> > san diego, california
>> > carlsbad, california
>> > san francisco, california
>> > 
>> > (which probably isn't the best way)
>> > hoping that search for san francisco real estate would map san
>> francisco
>> > to california and hence return the other two california results
>> > 
>> > OR
>> > 
>> > adding the mapping of city to state in the index itself like..
>> > 
>> > title                                         state             city          
>>                         
>> > description...
>> > real estate san diego eg 1    california   carlsbad, san francisco, san
>> > diego        some description
>> > real estate carlsbad eg 2      california   carlsbad, san francisco,
>> san
>> > diego        some description
>> > 
>> > which of the above two is better. Does a huge synonym file affect
>> > performance. Or Is there a even better way? I'm sure there is but I
>> can't
>> > put my finger on it yet & I'm not familiar with java either.
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23861631.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Downside to a huge synonyms file?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

300K is a pretty small index.  I wouldn't worry about the number of synonyms unless you are turning a single term into dozens of ORed terms.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: anuvenk <an...@hotmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 2, 2009 11:28:43 PM
> Subject: Re: Is there Downside to a huge synonyms file?
> 
> 
> I'm using query time synonyms. I have more fields in my index though. This is
> just an example or sample of data from my index. Yes, we don't have millions
> of documents. Could be around 300,000 and might increase in future. The
> reason i'm using query time synonyms is because of the nature of my data. I
> can't re-index the data everytime i add or remove a synonym. But for this
> particular requirement is it best to have index time synonyms because of the
> multi-word synonym nature. Again if i add more cities list to the synonym
> file, I can't be re-indexing all the data over and over again. 
> 
> 
> 
> anuvenk wrote:
> > 
> > In my index i have legal faqs, forms, legal videos etc with a state field
> > for each resource.
> > Now if i search for real estate san diego, I want to be able to return
> > other 'california' results i.e results from san francisco.
> > I have the following fields in the index
> > 
> > title                                                  state          
> > description...
> > real estate san diego example 1           california         some
> > description
> > real estate carlsbad example 2             california         some desc
> > 
> > so when i search for real estate san francisco, since there is no match, i
> > want to be able to return the other real estate results in california
> > instead of returning none. Because sometimes they might be searching for a
> > real estate form and city probably doesn't matter. 
> > 
> > I have two things in mind. One is adding a synonym mapping
> > san diego, california
> > carlsbad, california
> > san francisco, california
> > 
> > (which probably isn't the best way)
> > hoping that search for san francisco real estate would map san francisco
> > to california and hence return the other two california results
> > 
> > OR
> > 
> > adding the mapping of city to state in the index itself like..
> > 
> > title                                         state             city          
>                         
> > description...
> > real estate san diego eg 1    california   carlsbad, san francisco, san
> > diego        some description
> > real estate carlsbad eg 2      california   carlsbad, san francisco, san
> > diego        some description
> > 
> > which of the above two is better. Does a huge synonym file affect
> > performance. Or Is there a even better way? I'm sure there is but I can't
> > put my finger on it yet & I'm not familiar with java either.
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Downside to a huge synonyms file?

Posted by Yonik Seeley <ys...@gmail.com>.
On Tue, Jun 2, 2009 at 11:28 PM, anuvenk <an...@hotmail.com> wrote:
> I'm using query time synonyms.

These don't currently work if the synonyms expand to more than one
option, and those options have a different number of words.

-Yonik
http://www.lucidimagination.com

Re: Is there Downside to a huge synonyms file?

Posted by anuvenk <an...@hotmail.com>.
I'm using query time synonyms. I have more fields in my index though. This is
just an example or sample of data from my index. Yes, we don't have millions
of documents. Could be around 300,000 and might increase in future. The
reason i'm using query time synonyms is because of the nature of my data. I
can't re-index the data everytime i add or remove a synonym. But for this
particular requirement is it best to have index time synonyms because of the
multi-word synonym nature. Again if i add more cities list to the synonym
file, I can't be re-indexing all the data over and over again. 



anuvenk wrote:
> 
> In my index i have legal faqs, forms, legal videos etc with a state field
> for each resource.
> Now if i search for real estate san diego, I want to be able to return
> other 'california' results i.e results from san francisco.
> I have the following fields in the index
> 
> title                                                  state          
> description...
> real estate san diego example 1           california         some
> description
> real estate carlsbad example 2             california         some desc
> 
> so when i search for real estate san francisco, since there is no match, i
> want to be able to return the other real estate results in california
> instead of returning none. Because sometimes they might be searching for a
> real estate form and city probably doesn't matter. 
> 
> I have two things in mind. One is adding a synonym mapping
> san diego, california
> carlsbad, california
> san francisco, california
> 
> (which probably isn't the best way)
> hoping that search for san francisco real estate would map san francisco
> to california and hence return the other two california results
> 
> OR
> 
> adding the mapping of city to state in the index itself like..
> 
> title                                         state             city                                  
> description...
> real estate san diego eg 1    california   carlsbad, san francisco, san
> diego        some description
> real estate carlsbad eg 2      california   carlsbad, san francisco, san
> diego        some description
> 
> which of the above two is better. Does a huge synonym file affect
> performance. Or Is there a even better way? I'm sure there is but I can't
> put my finger on it yet & I'm not familiar with java either.
> 
> 

-- 
View this message in context: http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there Downside to a huge synonyms file?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

If index-time synonym expansion/indexing is used, then a large synonym file means your index is going to be bigger.
If query-time synonym expansion is used, then your queries are going to be larger (i.e. more ORs, thus a bit slower).

How much, it really depends on your specific synonyms, so I can't generalize.  I have a feeling you are not dealing with millions of documents, in which case you can most likely ignore increase in index or query size.

 
Adding synonyms sounds like the easiest approach.  I'd try that and worry about improvement only IF I see that doesn't give adequate results.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: anuvenk <an...@hotmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 2, 2009 6:55:27 PM
> Subject: Is there Downside to a huge synonyms file?
> 
> 
> In my index i have legal faqs, forms, legal videos etc with a state field for
> each resource.
> Now if i search for real estate san diego, I want to be able to return other
> 'california' results i.e results from san francisco.
> I have the following fields in the index
> 
> title                                                  state          
> description...
> real estate san diego example 1           california         some
> description
> real estate carlsbad example 2             california         some desc
> 
> so when i search for real estate san francisco, since there is no match, i
> want to be able to return the other real estate results in california
> instead of returning none. Because sometimes they might be searching for a
> real estate form and city probably doesn't matter. 
> 
> I have two things in mind. One is adding a synonym mapping
> san diego, california
> carlsbad, california
> san francisco, california
> 
> (which probably isn't the best way)
> hoping that search for san francisco real estate would map san francisco to
> california and hence return the other two california results
> 
> OR
> 
> adding the mapping of city to state in the index itself like..
> 
> title                                         state             city            
>                       
> description...
> real estate san diego eg 1    california   carlsbad, san francisco, san
> diego        some description
> real estate carlsbad eg 2      california   carlsbad, san francisco, san
> diego        some description
> 
> which of the above two is better. Does a huge synonym file affect
> performance. Or Is there a even better way? I'm sure there is but I can't
> put my finger on it yet & I'm not familiar with java either.
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23842527.html
> Sent from the Solr - User mailing list archive at Nabble.com.