You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Karel Tejnora <ka...@tejnora.cz> on 2006/01/11 17:50:58 UTC

Analyzers, perfect hash, ICU

Hi all,
    I'm working on the analyzer for the slovanic latin languages (cs,sk) 
w/o stemming at first.
I would like to ask you:
The StopWord analyzer uses often HashSet implementation, but the the 
Stopwords are not changed often (if ever) from shipped in the java code. 
Do you think that is there benefit for the perfect hash algorithm?
I will do an ICU analyzer for latin chars (decompositing and return base 
char). Have you any exp. with icu(.sf.net) some problems, bottlenecks?

Thx,
Karel

P. S.: also I would like these stuff contribute to lucene-contrib if 
it'll be recognized useful. Is there any  howto  set the Eclipse for 
Lucene/Apache related project?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Analyzers, perfect hash, ICU

Posted by Ken Krugler <kk...@transpac.com>.
>Hi all,
>    I'm working on the analyzer for the slovanic latin languages 
>(cs,sk) w/o stemming at first.
>I would like to ask you:
>The StopWord analyzer uses often HashSet implementation, but the the 
>Stopwords are not changed often (if ever) from shipped in the java 
>code. Do you think that is there benefit for the perfect hash 
>algorithm?

My guess is that you wouldn't save much time here using a perfect hash.

>I will do an ICU analyzer for latin chars (decompositing and return 
>base char). Have you any exp. with icu(.sf.net) some problems, 
>bottlenecks?

This could be a significant performance hit. Using ICU is a good 
idea, but typically putting some simple front-end filtering in front 
can save you a lot of time.

E.g. if there are a lot of characters that don't require any 
decomposition, you could do some quick (and very conservative) checks 
to skip calls to ICU.

But of course, measure then optimize :)

>P. S.: also I would like these stuff contribute to lucene-contrib if 
>it'll be recognized useful. Is there any  howto  set the Eclipse for 
>Lucene/Apache related project?

If you're asking about how to set up Eclipse to do development for 
Lucene, I found some posts to the mailing list a while back, but 
nothing definitive.

FWIW, my experience w/Eclipse 3.1 was that trying to auto-create 
Eclipse projects using the Ant build file didn't work very well. So 
we wound up manually creating the project, setting up the classpath, 
etc.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org