You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Fenbers <ma...@noaa.gov> on 2015/10/01 13:48:13 UTC

Solr vs Lucene

Greetings!

Being a newbie, I'm still mostly in the dark regarding where the line is 
between Solr and Lucene.  The following code snippet is -- I think -- 
all Lucene and no Solr.  It is a significantly modified version of some 
example code I found on the net.

dir = 
FSDirectory.open(FileSystems.getDefault().getPath("/localapps/dev/EventLog/solr/data", 
"SpellIndex"));
speller = new SpellChecker(dir);
fis = new FileInputStream("/usr/share/dict/words");
analyzer = new StandardAnalyzer();
speller.indexDictionary(new PlainTextDictionary(EventLog.fis), new 
IndexWriterConfig(analyzer), false);

// now let's see speller in action...
System.out.println(speller.exist("beez"));  // returns false
System.out.println(speller.exist("bees"));  // returns true

String[] suggestions = speller.suggestSimilar("beez", 10);
for (String suggestion : suggestions)
     System.err.println(suggestion);

(Later in my code, I close what objects need to be...)  This code 
(above) does the following:

 1. identifies whether a given word is misspelled or spelled correctly.
 2. Gives alternate suggestions to a given word (whether spelled
    correctly or not).
 3. I presume, but haven't tested this yet, that I can add a second or
    third word list to the index, say, a site dictionary containing
    names of people or places commonly found in the text.

But this code does not:

 1. parse any given text into words, and testing each word.
 2. provide markers showing where the misspelled/suspect words are
    within the text.

and so my code will have to provide the latter functionality.  Or does 
Solr provide this capability, such that it would be silly to write my own?

Thanks,

Mark

Re: Solr vs Lucene

Posted by Jack Krupansky <ja...@gmail.com>.

Did you have a specific reason why you didn't want to send an HTTP request
to Solr to perform the spellcheck operation? I mean, that is probably
easier than diving into raw Lucene code. Also, Solr lets you do a
spellcheck from a remote client whereas the Lucene spellcheck needs to be
on the same machine as the Lucene/Solr index directory.

-- Jack Krupansky

On Fri, Oct 2, 2015 at 7:42 AM, Mark Fenbers <ma...@noaa.gov> wrote:

> Thanks for the suggestion, but I've looked at aspell and hunspell and
> neither provide a native Java API.  Further, I already use Solr for a
> search engine, too, so why not stick with this infrastructure for spelling,
> too?  I think it will work well for me once I figure out the right
> configuration to get it to do what I want it to.
>
> Mark
>
>
> On 10/1/2015 4:16 PM, Walter Underwood wrote:
>
>> If you want a spell checker, don’t use a search engine. Use a spell
>> checker. Something like aspell (http://aspell.net/ <http://aspell.net/>)
>> will be faster and better than Solr.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>
>

Re: Solr vs Lucene

Posted by Mark Fenbers <ma...@noaa.gov>.

Thanks for the suggestion, but I've looked at aspell and hunspell and 
neither provide a native Java API.  Further, I already use Solr for a 
search engine, too, so why not stick with this infrastructure for 
spelling, too?  I think it will work well for me once I figure out the 
right configuration to get it to do what I want it to.

Mark

On 10/1/2015 4:16 PM, Walter Underwood wrote:
> If you want a spell checker, don’t use a search engine. Use a spell checker. Something like aspell (http://aspell.net/ <http://aspell.net/>) will be faster and better than Solr.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>

Re: Solr vs Lucene

Posted by Walter Underwood <wu...@wunderwood.org>.

If you want a spell checker, don’t use a search engine. Use a spell checker. Something like aspell (http://aspell.net/ <http://aspell.net/>) will be faster and better than Solr.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 1, 2015, at 1:06 PM, Mark Fenbers <ma...@noaa.gov> wrote:
> 
> This is with Solr.  The Lucene approach (assuming that is what is in my Java code, shared previously) works flawlessly, albeit with fewer options, AFAIK.
> 
> I'm not sure what you mean by "business case"...  I'm wanting to spell-check user-supplied text in my Java app.  The end-user then activates the spell-checker on the entire text (presumably, a few paragraphs or less).  I can use StyledText's capabilities to highlight the misspelled words, and when the user clicks the highlighted word, a menu will appear where he can select a suggested spelling.
> 
> But so far, I've had trouble:
> 
> * determining which words are misspelled (because Solr often returns
>   suggestions for correctly spelled words).
> * getting coherent suggestions (regardless if the query word is
>   misspelled or not).
> 
> It's been a bit puzzling (and frustrating)!!  it only took me 10 minutes to get the Lucene spell checker working, but I agree that Solr would be the better way to go, if I can ever get it configured properly...
> 
> Mark
> 
> 
> On 10/1/2015 12:50 PM, Alexandre Rafalovitch wrote:
>> Is that with Lucene or with Solr? Because Solr has several different
>> spell-checker modules you can configure.  I would recommend trying
>> them first.
>> 
>> And, frankly, I still don't know what your business case is.
>> 
>> Regards,
>>    Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>> 
>> 
>> On 1 October 2015 at 12:38, Mark Fenbers <ma...@noaa.gov> wrote:
>>> Yes, and I've spend numerous hours configuring and reconfiguring, and
>>> eventually even starting over, but still have not getting it to work right.
>>> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
>>> is purely as an example."  and I get back really bizarre suggestions, like
>>> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
>>> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
>>> extended results showing the frequencies of these suggestions being over
>>> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
>>> only using a file-based spell checker (/usr/shar/dict/words), and the
>>> wordbreak checker.
>>> 
>>> At this point, I can't even figure out how to narrow down my confusion so
>>> that I can post concise questions to the group.  But I'll get there
>>> eventually, starting with removing the wordbreak checker for the time-being.
>>> Your response was encouraging, at least.
>>> 
>>> Mark
>>> 
>>> 
>>> 
>>> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
>>>> Hi Mark,
>>>> 
>>>> Have you gone through a Solr tutorial yet? If/when you do, you will
>>>> see you don't need to code any of this. It is configured as part of
>>>> the web-facing total offering which are tweaked by XML configuration
>>>> files (or REST API calls). And most of the standard pipelines are
>>>> already pre-configured, so you don't need to invent them from scratch.
>>>> 
>>>> On your specific question, it would be better to ask what _business_
>>>> level functionality you are trying to achieve and see if Solr can help
>>>> with that. Starting from Lucene code is less useful :-)
>>>> 
>>>> Regards,
>>>>     Alex.
>>>> ----
>>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>>> http://www.solr-start.com/
>>>> 
>>>> 
>>>> On 1 October 2015 at 07:48, Mark Fenbers <ma...@noaa.gov> wrote:
>

Re: Solr vs Lucene

Posted by Mark Fenbers <ma...@noaa.gov>.

This is with Solr.  The Lucene approach (assuming that is what is in my 
Java code, shared previously) works flawlessly, albeit with fewer 
options, AFAIK.

I'm not sure what you mean by "business case"...  I'm wanting to 
spell-check user-supplied text in my Java app.  The end-user then 
activates the spell-checker on the entire text (presumably, a few 
paragraphs or less).  I can use StyledText's capabilities to highlight 
the misspelled words, and when the user clicks the highlighted word, a 
menu will appear where he can select a suggested spelling.

But so far, I've had trouble:

  * determining which words are misspelled (because Solr often returns
    suggestions for correctly spelled words).
  * getting coherent suggestions (regardless if the query word is
    misspelled or not).

It's been a bit puzzling (and frustrating)!!  it only took me 10 minutes 
to get the Lucene spell checker working, but I agree that Solr would be 
the better way to go, if I can ever get it configured properly...

Mark


On 10/1/2015 12:50 PM, Alexandre Rafalovitch wrote:
> Is that with Lucene or with Solr? Because Solr has several different
> spell-checker modules you can configure.  I would recommend trying
> them first.
>
> And, frankly, I still don't know what your business case is.
>
> Regards,
>     Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 1 October 2015 at 12:38, Mark Fenbers <ma...@noaa.gov> wrote:
>> Yes, and I've spend numerous hours configuring and reconfiguring, and
>> eventually even starting over, but still have not getting it to work right.
>> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
>> is purely as an example."  and I get back really bizarre suggestions, like
>> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
>> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
>> extended results showing the frequencies of these suggestions being over
>> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
>> only using a file-based spell checker (/usr/shar/dict/words), and the
>> wordbreak checker.
>>
>> At this point, I can't even figure out how to narrow down my confusion so
>> that I can post concise questions to the group.  But I'll get there
>> eventually, starting with removing the wordbreak checker for the time-being.
>> Your response was encouraging, at least.
>>
>> Mark
>>
>>
>>
>> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
>>> Hi Mark,
>>>
>>> Have you gone through a Solr tutorial yet? If/when you do, you will
>>> see you don't need to code any of this. It is configured as part of
>>> the web-facing total offering which are tweaked by XML configuration
>>> files (or REST API calls). And most of the standard pipelines are
>>> already pre-configured, so you don't need to invent them from scratch.
>>>
>>> On your specific question, it would be better to ask what _business_
>>> level functionality you are trying to achieve and see if Solr can help
>>> with that. Starting from Lucene code is less useful :-)
>>>
>>> Regards,
>>>      Alex.
>>> ----
>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>> http://www.solr-start.com/
>>>
>>>
>>> On 1 October 2015 at 07:48, Mark Fenbers <ma...@noaa.gov> wrote:

Re: Solr vs Lucene

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Is that with Lucene or with Solr? Because Solr has several different
spell-checker modules you can configure.  I would recommend trying
them first.

And, frankly, I still don't know what your business case is.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 12:38, Mark Fenbers <ma...@noaa.gov> wrote:
> Yes, and I've spend numerous hours configuring and reconfiguring, and
> eventually even starting over, but still have not getting it to work right.
> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
> is purely as an example."  and I get back really bizarre suggestions, like
> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
> extended results showing the frequencies of these suggestions being over
> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
> only using a file-based spell checker (/usr/shar/dict/words), and the
> wordbreak checker.
>
> At this point, I can't even figure out how to narrow down my confusion so
> that I can post concise questions to the group.  But I'll get there
> eventually, starting with removing the wordbreak checker for the time-being.
> Your response was encouraging, at least.
>
> Mark
>
>
>
> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
>>
>> Hi Mark,
>>
>> Have you gone through a Solr tutorial yet? If/when you do, you will
>> see you don't need to code any of this. It is configured as part of
>> the web-facing total offering which are tweaked by XML configuration
>> files (or REST API calls). And most of the standard pipelines are
>> already pre-configured, so you don't need to invent them from scratch.
>>
>> On your specific question, it would be better to ask what _business_
>> level functionality you are trying to achieve and see if Solr can help
>> with that. Starting from Lucene code is less useful :-)
>>
>> Regards,
>>     Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 1 October 2015 at 07:48, Mark Fenbers <ma...@noaa.gov> wrote:

Re: Solr vs Lucene

Posted by Mark Fenbers <ma...@noaa.gov>.

Yes, and I've spend numerous hours configuring and reconfiguring, and 
eventually even starting over, but still have not getting it to work 
right.  Even now, I'm getting bizarre results.  For example, I query   
"NOTE: This is purely as an example."  and I get back really bizarre 
suggestions, like "n ot e" and "n o te" and "n o t e" for the first word 
which isn't even misspelled!  The same goes for "purely" and "example" 
also!  Moreover, I get extended results showing the frequencies of these 
suggestions being over 2600 occurrences, when I'm not even using an 
indexed spell checker.  I'm only using a file-based spell checker 
(/usr/shar/dict/words), and the wordbreak checker.

At this point, I can't even figure out how to narrow down my confusion 
so that I can post concise questions to the group.  But I'll get there 
eventually, starting with removing the wordbreak checker for the 
time-being.  Your response was encouraging, at least.

Mark

On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
> Hi Mark,
>
> Have you gone through a Solr tutorial yet? If/when you do, you will
> see you don't need to code any of this. It is configured as part of
> the web-facing total offering which are tweaked by XML configuration
> files (or REST API calls). And most of the standard pipelines are
> already pre-configured, so you don't need to invent them from scratch.
>
> On your specific question, it would be better to ask what _business_
> level functionality you are trying to achieve and see if Solr can help
> with that. Starting from Lucene code is less useful :-)
>
> Regards,
>     Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 1 October 2015 at 07:48, Mark Fenbers <ma...@noaa.gov> wrote:

Re: Solr vs Lucene

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Hi Mark,

Have you gone through a Solr tutorial yet? If/when you do, you will
see you don't need to code any of this. It is configured as part of
the web-facing total offering which are tweaked by XML configuration
files (or REST API calls). And most of the standard pipelines are
already pre-configured, so you don't need to invent them from scratch.

On your specific question, it would be better to ask what _business_
level functionality you are trying to achieve and see if Solr can help
with that. Starting from Lucene code is less useful :-)

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 1 October 2015 at 07:48, Mark Fenbers <ma...@noaa.gov> wrote:
> Greetings!
>
> Being a newbie, I'm still mostly in the dark regarding where the line is
> between Solr and Lucene.  The following code snippet is -- I think -- all
> Lucene and no Solr.  It is a significantly modified version of some example
> code I found on the net.
>
> dir =
> FSDirectory.open(FileSystems.getDefault().getPath("/localapps/dev/EventLog/solr/data",
> "SpellIndex"));
> speller = new SpellChecker(dir);
> fis = new FileInputStream("/usr/share/dict/words");
> analyzer = new StandardAnalyzer();
> speller.indexDictionary(new PlainTextDictionary(EventLog.fis), new
> IndexWriterConfig(analyzer), false);
>
> // now let's see speller in action...
> System.out.println(speller.exist("beez"));  // returns false
> System.out.println(speller.exist("bees"));  // returns true
>
> String[] suggestions = speller.suggestSimilar("beez", 10);
> for (String suggestion : suggestions)
>     System.err.println(suggestion);
>
> (Later in my code, I close what objects need to be...)  This code (above)
> does the following:
>
> 1. identifies whether a given word is misspelled or spelled correctly.
> 2. Gives alternate suggestions to a given word (whether spelled
>    correctly or not).
> 3. I presume, but haven't tested this yet, that I can add a second or
>    third word list to the index, say, a site dictionary containing
>    names of people or places commonly found in the text.
>
> But this code does not:
>
> 1. parse any given text into words, and testing each word.
> 2. provide markers showing where the misspelled/suspect words are
>    within the text.
>
> and so my code will have to provide the latter functionality.  Or does Solr
> provide this capability, such that it would be silly to write my own?
>
> Thanks,
>
> Mark
>