You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by André Schild <a....@aarboard.ch> on 2012/11/26 15:52:28 UTC

Spellchecker for multiple sites (and languages?)

Hello,

we are a long time nutch user (Since 0.7)
Now we made the big jump from 0.9 to 1.5 and solr 4.0


We use it to index different websites and then provide site specific 
search for these.

Currently we index the sites and store them all in one solr instance.
The different sites are separated via the host entry in solr, this works 
fine.

An important thing is, that each site can have text in multiple 
languages (For example en, de, fr, cn etc.)
We separate the via the lang flag (thins works fine)

We now with to integrate the spellchecker to provide the "Did you 
mean...." functionality.
This works only partly fine, since it will always have a word list over 
all sites and all languages....
We would need to have a wordlist/spellchecker (based on the content 
field) which is "separate" for each site and language.

What would a clean way to solve this requirement bee ?

When we create a solr instance per site, then we would at least get the 
wordlist separated by site,
but then we still have the problem on separating them by language.....


Any ideas/hints ?

With best regards


-- 
Aarboard AG    Phone: +41 32 332 97 14
Egliweg 10     Fax:   +41 32 332 97 15
2560 Nidau
Switzerland    www.aarboard.ch


RE: Spellchecker for multiple sites (and languages?)

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Also see this recent mail list thread for an explanation how you can set up a master dictionary with everything in it but only get valid spell suggestions returned:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3C8F0D0142CA7ECC4287A9EC1BD8CB880C182DD3F939@USLVDCMBVP01.ingramcontent.com%3E

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Monday, November 26, 2012 9:32 AM
To: solr-user@lucene.apache.org
Subject: RE: Spellchecker for multiple sites (and languages?)

Hi - check the new spellchecker collate options. It limits spellchecker suggestions to the fq restrictions. If you filter on specific hosts, the spellchecker will only provide suggestions that are found in that host. Same goes for language.

http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

-----Original message-----
> From:André Schild <a....@aarboard.ch>
> Sent: Mon 26-Nov-2012 15:58
> To: solr-user@lucene.apache.org
> Subject: Spellchecker for multiple sites (and languages?)
> 
> Hello,
> 
> we are a long time nutch user (Since 0.7)
> Now we made the big jump from 0.9 to 1.5 and solr 4.0
> 
> 
> We use it to index different websites and then provide site specific 
> search for these.
> 
> Currently we index the sites and store them all in one solr instance.
> The different sites are separated via the host entry in solr, this works 
> fine.
> 
> An important thing is, that each site can have text in multiple 
> languages (For example en, de, fr, cn etc.)
> We separate the via the lang flag (thins works fine)
> 
> We now with to integrate the spellchecker to provide the "Did you 
> mean...." functionality.
> This works only partly fine, since it will always have a word list over 
> all sites and all languages....
> We would need to have a wordlist/spellchecker (based on the content 
> field) which is "separate" for each site and language.
> 
> What would a clean way to solve this requirement bee ?
> 
> When we create a solr instance per site, then we would at least get the 
> wordlist separated by site,
> but then we still have the problem on separating them by language.....
> 
> 
> Any ideas/hints ?
> 
> With best regards
> 
> 
> -- 
> Aarboard AG    Phone: +41 32 332 97 14
> Egliweg 10     Fax:   +41 32 332 97 15
> 2560 Nidau
> Switzerland    www.aarboard.ch
> 
> 


Re: Spellchecker for multiple sites (and languages?)

Posted by André Schild <a....@aarboard.ch>.
Ok,

thanks

André

Am 26.11.2012 18:45, schrieb Dyer, James:
> The Lucene spellcheckers just look at each word in isolation, which is what the extended results are reporting on.  So when using "maxCollationTries", etc, this information becomes less useful.  Its when Solr tries to put these words together into a meaningful collation that you get a good query with only applicable data returned.  If you need more information than just the re-written query, set "spellcheck.collateExtendedResults=true".  It will tell you which original word was replaced with what, etc.  This way if you need to re-write the query in a custom manner or want to give the users a message about which words were misspelled, etc, you can do so easily.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: André Schild [mailto:a.schild@aarboard.ch]
> Sent: Monday, November 26, 2012 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Spellchecker for multiple sites (and languages?)
>
> Am 26.11.2012 16:32, schrieb Markus Jelsma:
>> Hi - check the new spellchecker collate options. It limits spellchecker suggestions to the fq restrictions. If you filter on specific hosts, the spellchecker will only provide suggestions that are found in that host. Same goes for language.
>>
>> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
> After some try&  error it does exactly why I wish, Thanks
>
> I noticed, that when I use spellcheck.extendedResults=true,
> that in the extended resulsts the other (invalid) results are still
> shown....
> a bug or expected behaviour in the extended resulsts ?
>
>
> André
>


-- 
Aarboard AG    Phone: +41 32 332 97 14
Egliweg 10     Fax:   +41 32 332 97 15
2560 Nidau
Switzerland    www.aarboard.ch


RE: Spellchecker for multiple sites (and languages?)

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
The Lucene spellcheckers just look at each word in isolation, which is what the extended results are reporting on.  So when using "maxCollationTries", etc, this information becomes less useful.  Its when Solr tries to put these words together into a meaningful collation that you get a good query with only applicable data returned.  If you need more information than just the re-written query, set "spellcheck.collateExtendedResults=true".  It will tell you which original word was replaced with what, etc.  This way if you need to re-write the query in a custom manner or want to give the users a message about which words were misspelled, etc, you can do so easily.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: André Schild [mailto:a.schild@aarboard.ch] 
Sent: Monday, November 26, 2012 11:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Spellchecker for multiple sites (and languages?)

Am 26.11.2012 16:32, schrieb Markus Jelsma:
> Hi - check the new spellchecker collate options. It limits spellchecker suggestions to the fq restrictions. If you filter on specific hosts, the spellchecker will only provide suggestions that are found in that host. Same goes for language.
>
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

After some try & error it does exactly why I wish, Thanks

I noticed, that when I use spellcheck.extendedResults=true,
that in the extended resulsts the other (invalid) results are still 
shown....
a bug or expected behaviour in the extended resulsts ?


André


Re: Spellchecker for multiple sites (and languages?)

Posted by André Schild <a....@aarboard.ch>.
Am 26.11.2012 16:32, schrieb Markus Jelsma:
> Hi - check the new spellchecker collate options. It limits spellchecker suggestions to the fq restrictions. If you filter on specific hosts, the spellchecker will only provide suggestions that are found in that host. Same goes for language.
>
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

After some try & error it does exactly why I wish, Thanks

I noticed, that when I use spellcheck.extendedResults=true,
that in the extended resulsts the other (invalid) results are still 
shown....
a bug or expected behaviour in the extended resulsts ?


André

RE: Spellchecker for multiple sites (and languages?)

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - check the new spellchecker collate options. It limits spellchecker suggestions to the fq restrictions. If you filter on specific hosts, the spellchecker will only provide suggestions that are found in that host. Same goes for language.

http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

-----Original message-----
> From:André Schild <a....@aarboard.ch>
> Sent: Mon 26-Nov-2012 15:58
> To: solr-user@lucene.apache.org
> Subject: Spellchecker for multiple sites (and languages?)
> 
> Hello,
> 
> we are a long time nutch user (Since 0.7)
> Now we made the big jump from 0.9 to 1.5 and solr 4.0
> 
> 
> We use it to index different websites and then provide site specific 
> search for these.
> 
> Currently we index the sites and store them all in one solr instance.
> The different sites are separated via the host entry in solr, this works 
> fine.
> 
> An important thing is, that each site can have text in multiple 
> languages (For example en, de, fr, cn etc.)
> We separate the via the lang flag (thins works fine)
> 
> We now with to integrate the spellchecker to provide the "Did you 
> mean...." functionality.
> This works only partly fine, since it will always have a word list over 
> all sites and all languages....
> We would need to have a wordlist/spellchecker (based on the content 
> field) which is "separate" for each site and language.
> 
> What would a clean way to solve this requirement bee ?
> 
> When we create a solr instance per site, then we would at least get the 
> wordlist separated by site,
> but then we still have the problem on separating them by language.....
> 
> 
> Any ideas/hints ?
> 
> With best regards
> 
> 
> -- 
> Aarboard AG    Phone: +41 32 332 97 14
> Egliweg 10     Fax:   +41 32 332 97 15
> 2560 Nidau
> Switzerland    www.aarboard.ch
> 
>