You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by paolo anghileri <pa...@codegeneration.it> on 2014/11/21 20:14:47 UTC

Lucene ancient greek normalization

For development purposes I need the ability in lucene to normalize 
ancient greek characters for al the cases of grammatical details such as 
accents, diacritics and so on.

My need is to retrieve ancient greek words with accents and other 
grammatical details by the input of the string without accents.

For example the input of οργανον (organon) should to retrieve also Ὄργανον,


I am not a lucene commiter and I a new to this so my question is about 
the best practice to implement this in Lucene, and possibile submit a 
commit proposal to Lucene A project management committee.

I have made some searches and found this file in Lucene-soir:


It contains normalization for some chars.
My thought would be to add extra normalization here, including all 
unicode ancient greek chars with all grammatical details.
I already have all the unicode values for that chars so It should not be 
difficult for me to include them

If my understanding is correct, this should add to lucene the features 
described above.


As I am new to this, my needs are:

 1.   To be sure that this is the correct place in Lucene for doing
    normalization
 2. How to post commit proposal


Any help appreciated

Kind regards

Paolo

Re: Lucene ancient greek normalization

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

On 21 November 2014 16:10, paolo anghileri
<pa...@codegeneration.it> wrote:
> The need is being able to search with simple strings without grammatical
> details and retrieve data with grammatical details.

I am pretty sure that this is what I did for a Thai dome. Actually, I
went another two steps and converted Thai to English transliteration
and then broadened phonetically. With Solr, in my case:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L35

So to me, the specific question would be whether Ancient Greek -
specifically - is present in the Unicode mapping tables, not the rest
of it.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene ancient greek normalization

Posted by paolo anghileri <pa...@codegeneration.it>.

Many thanks Alex,

For clearness, I try explaining a bit what I would like to do:
I'd like to use mediawiki as a base for this project.
The need is being able to search with simple strings without grammatical 
details and retrieve data with grammatical details.
For that, I am evaluating to use a wikimedia extension called CirrusSearch.
CirrusSearch depends from elasticsearch, while elasticsearch depends on 
Lucene.

CirrusSearch (and its dependencies) is used, for instance, by the modern 
greek wictionary, and works correctly for modern greek grammatical details.

In this case, if you input αλφα it will retrieve also άλφα

but in the case of ancient greek, οργανον will not retrieve Ὄργανον 
since its grammatical details are proper of ancient greek and do not 
appear to be supported.

Since this kind of wikipedia search is at end based on lucene, adding 
this feature to lucene will potentially make this feature available also 
for wikimedia.

As Tim remarks in following message, it seems that ICU is able to 
support this.

I have to investigate a little more about this, and check if CirruSearch 
is implementing ICU.

About the third link you are providing:

https://issues.apache.org/jira/browse/LUCENE-1343

It seems that the first one I indicated:

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java

Does something similar but specialized for greek. This source converts 
also some diacritics, but is lacking many other chars.
At a first point, my idea was adding extra normalization here.

I'll do some other searches next week, both in lucene and in 
cirrusSearch docs and I'll let you know

Thanks to you and Tim for taking time on this

Regards

Paolo

On 21/11/2014 21:07, Alexandre Rafalovitch wrote:
> Are you sure that's not something that's already addressed by the ICU
> Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
>     Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <pa...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>>   To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Lucene ancient greek normalization

Posted by "Allison, Timothy B." <ta...@mitre.org>.

If you are using Solr, you can configure your analysis chain to use the ICUFoldingFilterFactory (https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory) and then view the results in the solr admin window.

If you are in pure Lucene (circa version 4.8, some mods will be required depending on your version):
1) Extend Analyzer:
	@Override
	protected TokenStreamComponents createComponents(String field, Reader reader) {
		Tokenizer stream = new StandardTokenizer(version, reader);
		TokenFilter icu = new ICUFoldingFilter(stream);
		return new TokenStreamComponents(stream, icu);
	}

2)
Then iterate through the tokens:

 		TokenStream stream = analyzer.tokenStream("", new StringReader(text));
		stream.reset();
		CharTermAttribute cattr = stream.getAttribute(CharTermAttribute.class);
		while (stream.incrementToken()) {
		            String token = cattr.toString();
...
-----Original Message-----
From: paolo anghileri [mailto:paolo.anghileri@codegeneration.it] 
Sent: Saturday, November 22, 2014 11:41 AM
To: Allison, Timothy B.
Subject: Re: Lucene ancient greek normalization

Sorry Timothy for the beginner question, how did you manage to run this 
test?

Many thanks

Paolo

On 21/11/2014 21:14, Allison, Timothy B. wrote:
> ICU looks promising:
>
> Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->
>
> 1.μηνιν
> 2.αειδε
> 3.θεα
> 4.πηληιαδεω
> 5.αχιλληοσ
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Friday, November 21, 2014 3:08 PM
> To: dev@lucene.apache.org
> Subject: Re: Lucene ancient greek normalization
>
> Are you sure that's not something that's already addressed by the ICU
> Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
>     Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <pa...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>>   To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

RE: Lucene ancient greek normalization

Posted by "Allison, Timothy B." <ta...@mitre.org>.

ICU looks promising:

Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->

1.μηνιν
2.αειδε
3.θεα
4.πηληιαδεω
5.αχιλληοσ

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Friday, November 21, 2014 3:08 PM
To: dev@lucene.apache.org
Subject: Re: Lucene ancient greek normalization

Are you sure that's not something that's already addressed by the ICU
Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html

If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek

There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 21 November 2014 14:14, paolo anghileri
<pa...@codegeneration.it> wrote:
> For development purposes I need the ability in lucene to normalize ancient
> greek characters for al the cases of grammatical details such as accents,
> diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about the
> best practice to implement this in Lucene, and possibile submit a commit
> proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all unicode
> ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not be
> difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features
> described above.
>
>
> As I am new to this, my needs are:
>
>  To be sure that this is the correct place in Lucene for doing normalization
> How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene ancient greek normalization

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Are you sure that's not something that's already addressed by the ICU
Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html

If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek

There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 21 November 2014 14:14, paolo anghileri
<pa...@codegeneration.it> wrote:
> For development purposes I need the ability in lucene to normalize ancient
> greek characters for al the cases of grammatical details such as accents,
> diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about the
> best practice to implement this in Lucene, and possibile submit a commit
> proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all unicode
> ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not be
> difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features
> described above.
>
>
> As I am new to this, my needs are:
>
>  To be sure that this is the correct place in Lucene for doing normalization
> How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Lucene ancient greek normalization

Posted by paolo anghileri <pa...@codegeneration.it>.

Sorry, forgot adding the link to lucene file:

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java

On 21/11/2014 20:14, paolo anghileri wrote:
> For development purposes I need the ability in lucene to normalize 
> ancient greek characters for al the cases of grammatical details such 
> as accents, diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other 
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also 
> Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about 
> the best practice to implement this in Lucene, and possibile submit a 
> commit proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all 
> unicode ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not 
> be difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features 
> described above.
>
>
> As I am new to this, my needs are:
>
>  1.  To be sure that this is the correct place in Lucene for doing
>     normalization
>  2. How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo
>