You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by paolo anghileri <pa...@codegeneration.it> on 2014/11/21 20:14:47 UTC
Lucene ancient greek normalization
For development purposes I need the ability in lucene to normalize
ancient greek characters for al the cases of grammatical details such as
accents, diacritics and so on.
My need is to retrieve ancient greek words with accents and other
grammatical details by the input of the string without accents.
For example the input of οργανον (organon) should to retrieve also Ὄργανον,
I am not a lucene commiter and I a new to this so my question is about
the best practice to implement this in Lucene, and possibile submit a
commit proposal to Lucene A project management committee.
I have made some searches and found this file in Lucene-soir:
It contains normalization for some chars.
My thought would be to add extra normalization here, including all
unicode ancient greek chars with all grammatical details.
I already have all the unicode values for that chars so It should not be
difficult for me to include them
If my understanding is correct, this should add to lucene the features
described above.
As I am new to this, my needs are:
1. To be sure that this is the correct place in Lucene for doing
normalization
2. How to post commit proposal
Any help appreciated
Kind regards
Paolo
Re: Lucene ancient greek normalization
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On 21 November 2014 16:10, paolo anghileri
<pa...@codegeneration.it> wrote:
> The need is being able to search with simple strings without grammatical
> details and retrieve data with grammatical details.
I am pretty sure that this is what I did for a Thai dome. Actually, I
went another two steps and converted Thai to English transliteration
and then broadened phonetically. With Solr, in my case:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L35
So to me, the specific question would be whether Ancient Greek -
specifically - is present in the Unicode mapping tables, not the rest
of it.
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Lucene ancient greek normalization
Posted by paolo anghileri <pa...@codegeneration.it>.
Many thanks Alex,
For clearness, I try explaining a bit what I would like to do:
I'd like to use mediawiki as a base for this project.
The need is being able to search with simple strings without grammatical
details and retrieve data with grammatical details.
For that, I am evaluating to use a wikimedia extension called CirrusSearch.
CirrusSearch depends from elasticsearch, while elasticsearch depends on
Lucene.
CirrusSearch (and its dependencies) is used, for instance, by the modern
greek wictionary, and works correctly for modern greek grammatical details.
In this case, if you input αλφα it will retrieve also άλφα
but in the case of ancient greek, οργανον will not retrieve Ὄργανον
since its grammatical details are proper of ancient greek and do not
appear to be supported.
Since this kind of wikipedia search is at end based on lucene, adding
this feature to lucene will potentially make this feature available also
for wikimedia.
As Tim remarks in following message, it seems that ICU is able to
support this.
I have to investigate a little more about this, and check if CirruSearch
is implementing ICU.
About the third link you are providing:
https://issues.apache.org/jira/browse/LUCENE-1343
It seems that the first one I indicated:
https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java
Does something similar but specialized for greek. This source converts
also some diacritics, but is lacking many other chars.
At a first point, my idea was adding extra normalization here.
I'll do some other searches next week, both in lucene and in
cirrusSearch docs and I'll let you know
Thanks to you and Tim for taking time on this
Regards
Paolo
On 21/11/2014 21:07, Alexandre Rafalovitch wrote:
> Are you sure that's not something that's already addressed by the ICU
> Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
> Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <pa...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>> To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
RE: Lucene ancient greek normalization
Posted by "Allison, Timothy B." <ta...@mitre.org>.
If you are using Solr, you can configure your analysis chain to use the ICUFoldingFilterFactory (https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory) and then view the results in the solr admin window.
If you are in pure Lucene (circa version 4.8, some mods will be required depending on your version):
1) Extend Analyzer:
@Override
protected TokenStreamComponents createComponents(String field, Reader reader) {
Tokenizer stream = new StandardTokenizer(version, reader);
TokenFilter icu = new ICUFoldingFilter(stream);
return new TokenStreamComponents(stream, icu);
}
2)
Then iterate through the tokens:
TokenStream stream = analyzer.tokenStream("", new StringReader(text));
stream.reset();
CharTermAttribute cattr = stream.getAttribute(CharTermAttribute.class);
while (stream.incrementToken()) {
String token = cattr.toString();
...
-----Original Message-----
From: paolo anghileri [mailto:paolo.anghileri@codegeneration.it]
Sent: Saturday, November 22, 2014 11:41 AM
To: Allison, Timothy B.
Subject: Re: Lucene ancient greek normalization
Sorry Timothy for the beginner question, how did you manage to run this
test?
Many thanks
Paolo
On 21/11/2014 21:14, Allison, Timothy B. wrote:
> ICU looks promising:
>
> Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->
>
> 1.μηνιν
> 2.αειδε
> 3.θεα
> 4.πηληιαδεω
> 5.αχιλληοσ
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Friday, November 21, 2014 3:08 PM
> To: dev@lucene.apache.org
> Subject: Re: Lucene ancient greek normalization
>
> Are you sure that's not something that's already addressed by the ICU
> Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
> Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <pa...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>> To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
RE: Lucene ancient greek normalization
Posted by "Allison, Timothy B." <ta...@mitre.org>.
ICU looks promising:
Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->
1.μηνιν
2.αειδε
3.θεα
4.πηληιαδεω
5.αχιλληοσ
-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
Sent: Friday, November 21, 2014 3:08 PM
To: dev@lucene.apache.org
Subject: Re: Lucene ancient greek normalization
Are you sure that's not something that's already addressed by the ICU
Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek
There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On 21 November 2014 14:14, paolo anghileri
<pa...@codegeneration.it> wrote:
> For development purposes I need the ability in lucene to normalize ancient
> greek characters for al the cases of grammatical details such as accents,
> diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about the
> best practice to implement this in Lucene, and possibile submit a commit
> proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all unicode
> ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not be
> difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features
> described above.
>
>
> As I am new to this, my needs are:
>
> To be sure that this is the correct place in Lucene for doing normalization
> How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Lucene ancient greek normalization
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Are you sure that's not something that's already addressed by the ICU
Filter? http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek
There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On 21 November 2014 14:14, paolo anghileri
<pa...@codegeneration.it> wrote:
> For development purposes I need the ability in lucene to normalize ancient
> greek characters for al the cases of grammatical details such as accents,
> diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about the
> best practice to implement this in Lucene, and possibile submit a commit
> proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all unicode
> ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not be
> difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features
> described above.
>
>
> As I am new to this, my needs are:
>
> To be sure that this is the correct place in Lucene for doing normalization
> How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Lucene ancient greek normalization
Posted by paolo anghileri <pa...@codegeneration.it>.
Sorry, forgot adding the link to lucene file:
https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java
On 21/11/2014 20:14, paolo anghileri wrote:
> For development purposes I need the ability in lucene to normalize
> ancient greek characters for al the cases of grammatical details such
> as accents, diacritics and so on.
>
> My need is to retrieve ancient greek words with accents and other
> grammatical details by the input of the string without accents.
>
> For example the input of οργανον (organon) should to retrieve also
> Ὄργανον,
>
>
> I am not a lucene commiter and I a new to this so my question is about
> the best practice to implement this in Lucene, and possibile submit a
> commit proposal to Lucene A project management committee.
>
> I have made some searches and found this file in Lucene-soir:
>
>
> It contains normalization for some chars.
> My thought would be to add extra normalization here, including all
> unicode ancient greek chars with all grammatical details.
> I already have all the unicode values for that chars so It should not
> be difficult for me to include them
>
> If my understanding is correct, this should add to lucene the features
> described above.
>
>
> As I am new to this, my needs are:
>
> 1. To be sure that this is the correct place in Lucene for doing
> normalization
> 2. How to post commit proposal
>
>
> Any help appreciated
>
> Kind regards
>
> Paolo
>