You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Eric Svensson <sa...@gmail.com> on 2010/02/25 09:25:16 UTC

Error in the StandardAnalyzer?

Hi,



I’ve noticed some odd token parsing by the StandardAnalyzer when using
slash(‘/’).

Below are four examples where I think there is error in one of them.



1. ab -> [ab]

2. /ab -> [ab]

3. a/b -> [b]

4. ab/ -> [ab]

5. aa/b -> [aa] [b]
6. a/bb - > [bb]
7. aa/bb -> [aa][bb]



Can someone explain to me why example 3 is returning [b] and not [ab], and
why example 6 is returning [bb] instead of [a] [bb]?



Furthermore I would like to know how to make the slash character searchable
by not removing it in the token parsing by the StandardAnalyzer.



Thanks for listening!

RE: Error in the StandardAnalyzer?

Posted by Michael Garski <mg...@myspace-inc.com>.
Actually, I mis-spoke on the StandardAnalyzer's handling of punctuation earlier... some punctuation is retained, specifically in regards to periods within URLs.

http://lucene.apache.org/lucene.net/

Tokenizes into 3 terms : http, lucene.apache.org, lucene.net

Michael

-----Original Message-----
From: Michael Garski 
Sent: Thursday, February 25, 2010 7:37 AM
To: lucene-net-user@lucene.apache.org
Subject: Re: Error in the StandardAnalyzer?

The standard analyzer removes all non-alphanumeric characters. If you  
need to retain them you'll need to use a different analyzer.

Michael

On Feb 25, 2010, at 1:04 AM, "Eric Svensson" <sa...@gmail.com> wrote:

>>
>> Is it because 'a' is a stop word? Does it return anything from 'a'?
>>
> Dooh..! You're of cause right. I didn't give the stop words any  
> thoughts at
> all. Thanks.
>
>
>
> Do you have any suggestions how to make the slash searchable?



Re: Error in the StandardAnalyzer?

Posted by Michael Garski <mg...@myspace-inc.com>.
The standard analyzer removes all non-alphanumeric characters. If you  
need to retain them you'll need to use a different analyzer.

Michael

On Feb 25, 2010, at 1:04 AM, "Eric Svensson" <sa...@gmail.com> wrote:

>>
>> Is it because 'a' is a stop word? Does it return anything from 'a'?
>>
> Dooh..! You're of cause right. I didn't give the stop words any  
> thoughts at
> all. Thanks.
>
>
>
> Do you have any suggestions how to make the slash searchable?


Re: Error in the StandardAnalyzer?

Posted by Eric Svensson <sa...@gmail.com>.
>
> Is it because 'a' is a stop word? Does it return anything from 'a'?
>
 Dooh..! You're of cause right. I didn't give the stop words any thoughts at
all. Thanks.



Do you have any suggestions how to make the slash searchable?

RE: Error in the StandardAnalyzer?

Posted by Hugh Spiller <Hu...@Renishaw.com>.
Is it because 'a' is a stop word? Does it return anything from 'a'?

________________________________

Hugh Spiller 
Internet Development Team 


-----Original Message-----
From: Eric Svensson [mailto:sarkgo@gmail.com] 
Sent: 25 February 2010 08:25
To: lucene-net-user@lucene.apache.org
Subject: Error in the StandardAnalyzer?

Hi,



I've noticed some odd token parsing by the StandardAnalyzer when using
slash('/').

Below are four examples where I think there is error in one of them.



1. ab -> [ab]

2. /ab -> [ab]

3. a/b -> [b]

4. ab/ -> [ab]

5. aa/b -> [aa] [b]
6. a/bb - > [bb]
7. aa/bb -> [aa][bb]



Can someone explain to me why example 3 is returning [b] and not [ab],
and
why example 6 is returning [bb] instead of [a] [bb]?



Furthermore I would like to know how to make the slash character
searchable
by not removing it in the token parsing by the StandardAnalyzer.



Thanks for listening!
--------------------------------------------------------------------------------------------------
This email and any attachments are confidential and are for the use of the addressee only. If you are not the addressee, you must not use or disclose the contents to any other person. Please immediately notify the sender and delete the email. Statements and opinions expressed here may not represent those of the company. Email correspondence is monitored by the company. This information may be subject to Export Control Regulation. You are obliged to comply with such Regulations

The parent company of the Renishaw Group is Renishaw plc, registered in England no. 1106260. Registered Office: New Mills, Wotton-under-Edge, Gloucestershire, GL12 8JR, United Kingdom. Tel +44 (0) 1453 524524
--------------------------------------------------------------------------------------------------