You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Alexis Miara <al...@hotmail.com> on 2015/05/20 17:40:36 UTC

Jena-text multilingual implementation

Hi,
This proposal aims to integrate language-specific support in jena-text.
It summarizes changes (and several discussions) done in https://github.com/apache/jena/pull/64 (JENA-928) and previously in https://github.com/apache/jena/pull/52. The forked branch is available at https://github.com/LICEF/jena/tree/jena-text-ml-single-index. A single patch file in also in attachement.

Below are the changes and new features made :

1) LocalizedAnalyzer
A new analyzer can now be specified (for indexation or query phases) to take advantage of Lucene language specific analyzers (stemming, stop words,...). Like other existent analyzers (SimpleAnalyzer, KeywordAnalyzer,..), it can be used in assembler specifications with the related language :

text:queryAnalyzer [
a text:LocalizedAnalyzer ;
text:language "en"
] In java code, it can be instantiated with the getLocalizedAnalyzer(lang) static method from org.apache.jena.query.text.analyzer.Util class.

2) TextIndexLuceneMultilingualThis new subclass of TextIndexLucene selects dynamically the right localized analyzer depending on literal's language. The selected analyzer is used for indexing and querying the index. Also, the lang is added by default in the index.To enable the multilingual support, just set the following option in the index assembler spec : <#indexLucene> a text:TextIndexLucene ;
text:directory "mem" ; text:multilingualSupport true; . 3) Explicit language field in the index Even if there is no need of linguistic analyzers, literal's languages can be stored in the index to extend query capabilities. For that, the new langField param must be set in the EntityMap assembler : <#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; text:langField "lang" ; . 4) UsageOnce langField is present in the index, in order to take it into account in sparql queries, set clauses like : ?s text:query (rdfs:label 'word' 'lang:en' ) //target english literals?s text:query (rdfs:label 'word' 'lang:none') //target unlocalized literals?s text:query (rdfs:label 'word') //ignore language The "lang:xx" parameter is removed from the arg list before the objectToStruct treatment to avoid possible conflicts.Extra params should be generalized in the same manner, ex: "limit:10", "score:x",... Hence it would allow params to be optional and would remove the order and size constraints. 5) RefactorizationTo simplify the TextDatasetFactory class, the TextIndexConfig class has been introduced. It avoids increasing the number of methods for each new parameter. This class provides a setter for each desired variable.EntityDefinition has changed in the same way.Example code and unit tests have changed accordingly. However, old methods could be re-introduced for backward compatibility.Saisissez du texte, l'adresse d'un site Web ou importez un document à traduire.AnnulerLangue source : Français Alexis MiaraAnalyst ProgrammerCentre de recherche LICEFTélé-université (TÉLUQ)Montréal (Québec), Canada

Re: Jena-text multilingual implementation

Posted by Osma Suominen <os...@helsinki.fi>.

20.05.2015, 18:40, Alexis Miara wrote:
> Hi,
> This proposal aims to integrate language-specific support in jena-text.

I've been coaching this along, as you can see on GitHub. I think Alex 
has done a lot of good work here and I'm in favor of merging this 
contribution.

Backward compatibility for text index configuration by Java code is 
potentially an issue, but I've understood that there is also other API 
churn going on at the moment with the Jena3 transition, so this might be 
a good moment to clean up the API and get rid of old methods.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)