You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by OBender <os...@hotmail.com> on 2009/07/13 20:35:33 UTC

strange issues with IRISH

Hi All,

 

I've came across very strange issue with Irish language.

I have the following set of strings in Irish:

 

ag an gcrosbhealach seo, 

Lean ar an mуrbhealach., 

Lean an bуthar seo., 

An bhfuil ... in am imeacht?, 

An ... sin an t-am ceart?

 

And here is a search string: an

 

Search returns nothing instead of all of those phrases. I'm using simple
analyzer but suspect that [an] is still ignored as a stop word for some
reason.

I've tried custom analyzer with the following code:

 

TokenStream ts = new WhitespaceTokenizer(reader);

ts = new LowerCaseFilter(ts);

return ts;

 

with no luck.

 

Any ideas?

 

Thanks.

Re: strange issues with IRISH

Posted by John Byrne <jo...@propylon.com>.

Hi,

"suspect that [an] is still ignored as a stop word for some reason"

Yes, "an" is still a stop word in English of course! (eg. 'an apple')

Your custom analyzer should work; are you making sure to do both your 
indexing *and* your searching with the new analyzer?

I think making a list of Irish stop words could be tricky, since "an" 
sometimes means "the", but sometimes forms part of a verb (eg. "an 
bhfuil...?")

The safest bet is probably not to bother removing stop words. These days 
it doesn't really affect performance much,storage space is generally not 
much of an issue, and it makes phrase searching more accurate if you 
keep them.

-John
> Hi All,
>
>  
>
> I've came across very strange issue with Irish language.
>
> I have the following set of strings in Irish:
>
>  
>
> ag an gcrosbhealach seo, 
>
> Lean ar an mуrbhealach., 
>
> Lean an bуthar seo., 
>
> An bhfuil ... in am imeacht?, 
>
> An ... sin an t-am ceart?
>
>  
>
> And here is a search string: an
>
>  
>
> Search returns nothing instead of all of those phrases. I'm using simple
> analyzer but suspect that [an] is still ignored as a stop word for some
> reason.
>
> I've tried custom analyzer with the following code:
>
>  
>
> TokenStream ts = new WhitespaceTokenizer(reader);
>
> ts = new LowerCaseFilter(ts);
>
> return ts;
>
>  
>
> with no luck.
>
>  
>
> Any ideas?
>
>  
>
> Thanks.
>
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.387 / Virus Database: 270.13.12/2233 - Release Date: 07/12/09 08:20:00
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org