You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ta...@controldocs.com on 2005/06/21 20:59:21 UTC

Anomaly in defining search phrase

I found a discrepancy in results for an identical search ("processing")
done with lucene and mysql. Seems like lucene is not returning results
where the search word is associated with "-"(hyphen) or '."(period). For
example it didn't returned result for a text that contained
"processing-7-bit" and "straighforwerd.processing" but mysql did. Is there
any settings issue or it is something unavoidable?

Thanks
Tareque
ControlDOCS

PS: In contrast to that, I previously found lucene returning some other
results those mysql didn't. For example search phrase associated with "'"
(apostrophe)  and "_"(underscore). I am not complaining about this. Rather
I found it preferable for my purpose.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Anomaly in defining search phrase

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 22, 2005, at 11:35 AM, tareque@controldocs.com wrote:
> Thanks! Using StopAnalyzer helped solving the problem. Is there any  
> detail
> documentation of what each of this analyzers do?

Here are some pointers:

     - Lucene's javadocs give a brief description, such as <http:// 
lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ 
StopAnalyzer.html>

     - The source code is the ultimate documentation: <http:// 
svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/ 
lucene/analysis/StopAnalyzer.java?rev=168970&view=markup> - look at  
the tokenStream method

     - Several Lucene articles: <http://wiki.apache.org/jakarta- 
lucene/Resources> with the most relevant being my java.net article  
here: <http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html>  
where the AnalysisDemo code is provided.

     - And last but certainly not least, "Lucene in Action" :)  You  
can search for details of analyzers at the lucenebook.com site, like  
this: <http://www.lucenebook.com/search?query=StopAnalyzer> The  
Analysis chapter in LIA provides in-depth details of each of the  
built-in analyzers.

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Anomaly in defining search phrase

Posted by ta...@controldocs.com.
>
> On Jun 21, 2005, at 2:59 PM, tareque@controldocs.com wrote:
>
>> I found a discrepancy in results for an identical search
>> ("processing")
>> done with lucene and mysql. Seems like lucene is not returning results
>> where the search word is associated with "-"(hyphen) or
>> '."(period). For
>> example it didn't returned result for a text that contained
>> "processing-7-bit" and "straighforwerd.processing" but mysql did.
>> Is there
>> any settings issue or it is something unavoidable?
>>
>> Thanks
>> Tareque
>> ControlDOCS
>>
>> PS: In contrast to that, I previously found lucene returning some
>> other
>> results those mysql didn't. For example search phrase associated
>> with "'"
>> (apostrophe)  and "_"(underscore). I am not complaining about this.
>> Rather
>> I found it preferable for my purpose.
>
> These all boil down to your choice of analyzer.  What analyzer are
> you using?
>
> As you can see below, "processing-7-bit" is tokenized quite
> differently depending on the analyzer:
>
> $ ant AnalyzerDemo
> Buildfile: build.xml
>
>      [input] String to analyze: [This string will be analyzed.]
> processing-7-bit
>       [echo] Running lia.analysis.AnalyzerDemo...
>       [java] Analyzing "processing-7-bit"
>       [java]   WhitespaceAnalyzer:
>       [java]     [processing-7-bit]
>
>       [java]   SimpleAnalyzer:
>       [java]     [processing] [bit]
>
>       [java]   StopAnalyzer:
>       [java]     [processing] [bit]
>
>       [java]   StandardAnalyzer:
>       [java]     [processing-7-bit]
>
> If you're using the StandardAnalyzer, you are not indexing the word
> "processing" at all.  Grab the source code from Lucene in Action at
> lucenebook.com and type "ant AnalyzerDemo" to try out the basic
> analyzers.
>
>      Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Thanks! Using StopAnalyzer helped solving the problem. Is there any detail
documentation of what each of this analyzers do?

Tareque


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Anomaly in defining search phrase

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 21, 2005, at 2:59 PM, tareque@controldocs.com wrote:

> I found a discrepancy in results for an identical search  
> ("processing")
> done with lucene and mysql. Seems like lucene is not returning results
> where the search word is associated with "-"(hyphen) or  
> '."(period). For
> example it didn't returned result for a text that contained
> "processing-7-bit" and "straighforwerd.processing" but mysql did.  
> Is there
> any settings issue or it is something unavoidable?
>
> Thanks
> Tareque
> ControlDOCS
>
> PS: In contrast to that, I previously found lucene returning some  
> other
> results those mysql didn't. For example search phrase associated  
> with "'"
> (apostrophe)  and "_"(underscore). I am not complaining about this.  
> Rather
> I found it preferable for my purpose.

These all boil down to your choice of analyzer.  What analyzer are  
you using?

As you can see below, "processing-7-bit" is tokenized quite  
differently depending on the analyzer:

$ ant AnalyzerDemo
Buildfile: build.xml

     [input] String to analyze: [This string will be analyzed.]
processing-7-bit
      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "processing-7-bit"
      [java]   WhitespaceAnalyzer:
      [java]     [processing-7-bit]

      [java]   SimpleAnalyzer:
      [java]     [processing] [bit]

      [java]   StopAnalyzer:
      [java]     [processing] [bit]

      [java]   StandardAnalyzer:
      [java]     [processing-7-bit]

If you're using the StandardAnalyzer, you are not indexing the word  
"processing" at all.  Grab the source code from Lucene in Action at  
lucenebook.com and type "ant AnalyzerDemo" to try out the basic  
analyzers.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org