You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/11/24 13:24:57 UTC

RE: Lucene ancient greek normalization

If you are using Solr, you can configure your analysis chain to use the ICUFoldingFilterFactory (https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory) and then view the results in the solr admin window.

If you are in pure Lucene (circa version 4.8, some mods will be required depending on your version):
1) Extend Analyzer:
	@Override
	protected TokenStreamComponents createComponents(String field, Reader reader) {
		Tokenizer stream = new StandardTokenizer(version, reader);
		TokenFilter icu = new ICUFoldingFilter(stream);
		return new TokenStreamComponents(stream, icu);
	}

2)
Then iterate through the tokens:

 		TokenStream stream = analyzer.tokenStream("", new StringReader(text));
		stream.reset();
		CharTermAttribute cattr = stream.getAttribute(CharTermAttribute.class);
		while (stream.incrementToken()) {
		            String token = cattr.toString();
...
-----Original Message-----
From: paolo anghileri [mailto:paolo.anghileri@codegeneration.it] 
Sent: Saturday, November 22, 2014 11:41 AM
To: Allison, Timothy B.
Subject: Re: Lucene ancient greek normalization

Sorry Timothy for the beginner question, how did you manage to run this 
test?

Many thanks

Paolo

On 21/11/2014 21:14, Allison, Timothy B. wrote:
> ICU looks promising:
>
> Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->
>
> 1.μηνιν
> 2.αειδε
> 3.θεα
> 4.πηληιαδεω
> 5.αχιλληοσ
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Friday, November 21, 2014 3:08 PM
> To: dev@lucene.apache.org
> Subject: Re: Lucene ancient greek normalization
>
> Are you sure that's not something that's already addressed by the ICU
> Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
>     Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <pa...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>>   To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>