You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kalani Ruwanpathirana <ka...@gmail.com> on 2008/08/25 09:19:54 UTC

Standard Analyzer

Hi,

I am using StandardAnalyzer when creating the Lucene index. It indexes the
word "wo&rk" as it is but does not index the word "wo*rk" in that manner.
Can I index such words (including * and ?) as it is? Otherwise I have no way
to index and search for words like "wo*rk", you?, etc.

Thanks
-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Standard Analyzer

Posted by Karl Wettin <ka...@gmail.com>.

25 aug 2008 kl. 11.14 skrev Kalani Ruwanpathirana:

> Hi,
>
> Thanks, I tried WhitespaceAnalyzer too, but it seems case sensitive.

Then you simply add a LowercaseFilter to the chain in the Analyzer:

public final class WhitespaceAnalyzer extends Analyzer {
   public TokenStream tokenStream(String fieldName, Reader reader) {
-    return new WhitespaceTokenizer(reader);
+    TokenStream ts = new WhitespaceTokenizer(reader);
+    ts = new LowercaseFilter(ts);
+    return ts;
   }


> If I need to search for words like "correct?", "<html>" (it escapes  
> <, > and
> another few characters too) I need to index those kind of words.

That sounds like an XY-problem to me:
http://www.perlmonks.org/index.pl?node_id=542341

What I really was asking about is what problem it is you are trying to  
solve by indexing and searching for these sort of tokens.


        karl


>
>
> On Mon, Aug 25, 2008 at 1:15 PM, Karl Wettin <ka...@gmail.com>  
> wrote:
>
>>
>> 25 aug 2008 kl. 09.19 skrev Kalani Ruwanpathirana:
>>
>> Hi,
>>>
>>> I am using StandardAnalyzer when creating the Lucene index. It  
>>> indexes the
>>> word "wo&rk" as it is but does not index the word "wo*rk" in that  
>>> manner.
>>> Can I index such words (including * and ?) as it is? Otherwise I  
>>> have no
>>> way
>>> to index and search for words like "wo*rk", you?, etc.
>>>
>>
>>
>> Try an alternative analyzer, perhaps WhitespaceAnalyzer?  
>> (StandardAnalyzer
>> will index wo&rk as a single term because it contains a rule to  
>> handle names
>> such as AT&T.)
>>
>> You should probably also explain why you need to create an index  
>> like this.
>>
>>
>>
>>       karl
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> -- 
> Kalani Ruwanpathirana
> Department of Computer Science & Engineering
> University of Moratuwa


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Standard Analyzer

Posted by Kalani Ruwanpathirana <ka...@gmail.com>.

Hi,

Thanks, I tried WhitespaceAnalyzer too, but it seems case sensitive.

If I need to search for words like "correct?", "<html>" (it escapes <, > and
another few characters too) I need to index those kind of words.

On Mon, Aug 25, 2008 at 1:15 PM, Karl Wettin <ka...@gmail.com> wrote:

>
> 25 aug 2008 kl. 09.19 skrev Kalani Ruwanpathirana:
>
>  Hi,
>>
>> I am using StandardAnalyzer when creating the Lucene index. It indexes the
>> word "wo&rk" as it is but does not index the word "wo*rk" in that manner.
>> Can I index such words (including * and ?) as it is? Otherwise I have no
>> way
>> to index and search for words like "wo*rk", you?, etc.
>>
>
>
> Try an alternative analyzer, perhaps WhitespaceAnalyzer? (StandardAnalyzer
> will index wo&rk as a single term because it contains a rule to handle names
> such as AT&T.)
>
> You should probably also explain why you need to create an index like this.
>
>
>
>        karl
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Standard Analyzer

Posted by Karl Wettin <ka...@gmail.com>.

25 aug 2008 kl. 09.19 skrev Kalani Ruwanpathirana:

> Hi,
>
> I am using StandardAnalyzer when creating the Lucene index. It  
> indexes the
> word "wo&rk" as it is but does not index the word "wo*rk" in that  
> manner.
> Can I index such words (including * and ?) as it is? Otherwise I  
> have no way
> to index and search for words like "wo*rk", you?, etc.


Try an alternative analyzer, perhaps WhitespaceAnalyzer?  
(StandardAnalyzer will index wo&rk as a single term because it  
contains a rule to handle names such as AT&T.)

You should probably also explain why you need to create an index like  
this.



         karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org