You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tomasz Wegrzanowski <to...@gmail.com> on 2011/11/15 05:52:02 UTC

File based wordlists for spellchecker

Hi,

I have a very large index, and I'm trying to add a spell checker for it.
I don't want to copy all text in index to extra spell field, since that would
be prohibitively big, and index is already close to how big it can
reasonably be,
so I just want to extract word frequencies as I index for offline processing.

After some filtering I get something like this (word, frequency):

a       122958495
aa      834203
aaa     175206
aaaa    22389
aaab    1522
aaai    1050
aaas    6384
aab     8109
aabb    1906
aac     35100
aacc    1692
aachen  11723

I wanted to use FileBasedSpellChecker, but it doesn't support frequencies,
so its recommendations are consistently horrible. Increasing frequency cutoff
won't really help that much - it will still suggest less frequent
words over equally
similar more frequent words.

What's the easiest way to get this working?
Presumably I'd need to create a separate index with just these words.
How do I get frequencies there, without actually creating 11723 records with
"aachen" in them etc.?

I can do some small Java coding if need be.
I'm already using 3.x branch (mostly for edismax, plus some unrelated
minor patches).

Thanks,
Tomasz

RE: File based wordlists for spellchecker

Posted by "Dyer, James" <Ja...@ingrambook.com>.

>Doesn't IndexBasedSpellChecker simply extract (word, freq) pairs from index,
>puts them into spellcheckingIndex, and forgets about the index altogether?
>If so, then I'd only need to override index building, and reuse that.
>Am I correct here, or does it actually go back to the original index?

You're correct.  It builds a stand-alone Lucene index to use as a dictionary (See IndexBasedSpellChecker.prepare().  This creates a HighFrequencyDictionary based on the field you want to run spellcheck against, then calls SpellChecker.indexDictionary, which builds the stand-alone Lucene index.)  You might be able to override IBSC.prepare() to use something external to the Solr index to send to the Lucene SpellChecker.  But in doing this you still are going to have all the overhead of creating a stand-alone Lucene index.  And I do not know of an easy way to get it to report a term frequency > 1 without having the term actually exist in that index that many times.  

If this is acceptable to you, from the looks of it FileBasedSpellChecker.loadExternalFileDictionary() will add a word in the document multiple times if it exists in the file more than once.  You could create your own file with "aachen" in it 11723 times.  Better yet, with a few minor modifications, you could have it load a custom file format that contains the doc frequency and then add the term however many times in a loop.  But this is going to still create a big dictionary and it won't reduce the overhead whenever you call "spellcheck.build=true".

Overall, your best bet might be to do a <copyField> and then use DirectSolrSpellChecker so that you do not have a separate lucene index for a dictionary.  While <copyField> will duplicate your terms, you trade that for not having the overhead of needing to build an external dictionary.  Unfortunately this is only an option if you're willing to upgrade to Trunk/4.0 .

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Tomasz Wegrzanowski [mailto:tomasz.wegrzanowski@gmail.com] 
Sent: Tuesday, November 15, 2011 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: File based wordlists for spellchecker

On 15 November 2011 15:55, Dyer, James <Ja...@ingrambook.com> wrote:
> Writing your own spellchecker to do what you propose might be difficult.  At issue is the fact that both the "index-based" and "file-based" spellcheckers are designed to work off a Lucene index and use the document frequency reported by Lucene to base their decisions.  Both spell checkers build a separate Lucene index on the fly to use as a dictionary just for this purpose.

I'm fine with spellchecker index, it will be small compared with
everything else.

I don't want every original record to have extra copyField since they
would probably be prohibitively huge.

> But maybe you don't need to go down that path.  If your original field is not being stemmed or aggresively analyzed, then you can base your spellchecker on the original field, and there is no need to do a <copyField> for a spell check index.  If you have to do a <copyField> for the dictionary due to stemming, etc in the original, you may be pleasantly surprised that the overhead for the copyField is a lot less than you thought.  Be sure to set it as stored=false,indexed=true and omitNorms=true.  I'd recommend trying this before anything else as it just might work.

My original index is stemmed and very aggressively analyzed, copyField
would be necessary.

> If you're worried about the size of the dictionary that gets built on the fly, then I would look into possibly upgrading to Trunk/4.0 and using DirectSolrSpellChecker, which does not build a separate dictionary.  If going to Trunk is out of the question, it might be possible for you to have it store your dictionary to a different disk if disk space is your issue.
>
> If you end up writing your own spellchecker, take a look at org.apache.lucene.search.spell.SpellChecker.  You'll need to write a "suggestSimilar" method that does what you want.  Possibly you can store your terms and frequencies in a hey/value hash and use that to order the results.  You then would need to write a wrapper for Solr, similar to org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would be a lot of work and it would take a lot of thought to make it perform well, etc.

Doesn't IndexBasedSpellChecker simply extract (word, freq) pairs from index,
puts them into spellcheckingIndex, and forgets about the index altogether?

If so, then I'd only need to override index building, and reuse that.

Am I correct here, or does it actually go back to the original index?

Re: File based wordlists for spellchecker

Posted by Tomasz Wegrzanowski <to...@gmail.com>.

On 15 November 2011 15:55, Dyer, James <Ja...@ingrambook.com> wrote:
> Writing your own spellchecker to do what you propose might be difficult.  At issue is the fact that both the "index-based" and "file-based" spellcheckers are designed to work off a Lucene index and use the document frequency reported by Lucene to base their decisions.  Both spell checkers build a separate Lucene index on the fly to use as a dictionary just for this purpose.

I'm fine with spellchecker index, it will be small compared with
everything else.

I don't want every original record to have extra copyField since they
would probably be prohibitively huge.

> But maybe you don't need to go down that path.  If your original field is not being stemmed or aggresively analyzed, then you can base your spellchecker on the original field, and there is no need to do a <copyField> for a spell check index.  If you have to do a <copyField> for the dictionary due to stemming, etc in the original, you may be pleasantly surprised that the overhead for the copyField is a lot less than you thought.  Be sure to set it as stored=false,indexed=true and omitNorms=true.  I'd recommend trying this before anything else as it just might work.

My original index is stemmed and very aggressively analyzed, copyField
would be necessary.

> If you're worried about the size of the dictionary that gets built on the fly, then I would look into possibly upgrading to Trunk/4.0 and using DirectSolrSpellChecker, which does not build a separate dictionary.  If going to Trunk is out of the question, it might be possible for you to have it store your dictionary to a different disk if disk space is your issue.
>
> If you end up writing your own spellchecker, take a look at org.apache.lucene.search.spell.SpellChecker.  You'll need to write a "suggestSimilar" method that does what you want.  Possibly you can store your terms and frequencies in a hey/value hash and use that to order the results.  You then would need to write a wrapper for Solr, similar to org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would be a lot of work and it would take a lot of thought to make it perform well, etc.

Doesn't IndexBasedSpellChecker simply extract (word, freq) pairs from index,
puts them into spellcheckingIndex, and forgets about the index altogether?

If so, then I'd only need to override index building, and reuse that.

Am I correct here, or does it actually go back to the original index?

RE: File based wordlists for spellchecker

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Writing your own spellchecker to do what you propose might be difficult.  At issue is the fact that both the "index-based" and "file-based" spellcheckers are designed to work off a Lucene index and use the document frequency reported by Lucene to base their decisions.  Both spell checkers build a separate Lucene index on the fly to use as a dictionary just for this purpose.

But maybe you don't need to go down that path.  If your original field is not being stemmed or aggresively analyzed, then you can base your spellchecker on the original field, and there is no need to do a <copyField> for a spell check index.  If you have to do a <copyField> for the dictionary due to stemming, etc in the original, you may be pleasantly surprised that the overhead for the copyField is a lot less than you thought.  Be sure to set it as stored=false,indexed=true and omitNorms=true.  I'd recommend trying this before anything else as it just might work.

If you're worried about the size of the dictionary that gets built on the fly, then I would look into possibly upgrading to Trunk/4.0 and using DirectSolrSpellChecker, which does not build a separate dictionary.  If going to Trunk is out of the question, it might be possible for you to have it store your dictionary to a different disk if disk space is your issue.

If you end up writing your own spellchecker, take a look at org.apache.lucene.search.spell.SpellChecker.  You'll need to write a "suggestSimilar" method that does what you want.  Possibly you can store your terms and frequencies in a hey/value hash and use that to order the results.  You then would need to write a wrapper for Solr, similar to org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would be a lot of work and it would take a lot of thought to make it perform well, etc.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Tomasz Wegrzanowski [mailto:tomasz.wegrzanowski@gmail.com] 
Sent: Monday, November 14, 2011 10:52 PM
To: solr-user@lucene.apache.org
Subject: File based wordlists for spellchecker

Hi,

I have a very large index, and I'm trying to add a spell checker for it.
I don't want to copy all text in index to extra spell field, since that would
be prohibitively big, and index is already close to how big it can
reasonably be,
so I just want to extract word frequencies as I index for offline processing.

After some filtering I get something like this (word, frequency):

a       122958495
aa      834203
aaa     175206
aaaa    22389
aaab    1522
aaai    1050
aaas    6384
aab     8109
aabb    1906
aac     35100
aacc    1692
aachen  11723

I wanted to use FileBasedSpellChecker, but it doesn't support frequencies,
so its recommendations are consistently horrible. Increasing frequency cutoff
won't really help that much - it will still suggest less frequent
words over equally
similar more frequent words.

What's the easiest way to get this working?
Presumably I'd need to create a separate index with just these words.
How do I get frequencies there, without actually creating 11723 records with
"aachen" in them etc.?

I can do some small Java coding if need be.
I'm already using 3.x branch (mostly for edismax, plus some unrelated
minor patches).

Thanks,
Tomasz