You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by to...@apache.org on 2017/10/30 09:06:55 UTC
svn commit: r1813736 - in
/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query: query.md search-mt.md
Author: tommaso
Date: Mon Oct 30 09:06:55 2017
New Revision: 1813736
URL: http://svn.apache.org/viewvc?rev=1813736&view=rev
Log:
OAK-4348 - added some documentation
Added:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/search-mt.md
Modified:
jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/query.md
Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/query.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/query.md?rev=1813736&r1=1813735&r2=1813736&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/query.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/query.md Mon Oct 30 09:06:55 2017
@@ -45,5 +45,6 @@ For more details on how indexing works (
### Customisations
* [Change Out-Of-The-Box Index Definitions](./ootb-index-change.html)
+* [Machine Translation for Search](./search-mt.html)
Added: jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/search-mt.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/search-mt.md?rev=1813736&view=auto
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/search-mt.md (added)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/search-mt.md Mon Oct 30 09:06:55 2017
@@ -0,0 +1,69 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+## Machine Translation for Search
+
+* [Query time MT for Lucene indexes](#qtmtl)
+ * [Apache Joshua](#joshua)
+ * [Language Packs](#languagepacks)
+ * [Setup](#setup)
+
+Oak supports CLIR (Cross Language Information Retrieval) by using _Machine Translation_ to decorate search queries.
+Such an extension is provided within the _oak-search-mt_ bundle.
+
+### <a name="qtmtl"></a> Query time MT for Lucene indexes
+
+Machine translation at query time is supported for Oak Lucene indexes by an extension of Oak Lucene's
+*FulltextQueryTermsProvider* API called *MTFulltextQueryTermsProvider*.
+The initial implementation details can be found in [OAK-4348](https://issues.apache.org/jira/browse/OAK-4348).
+
+The *MTFulltextQueryTermsProvider* will take the text of a given query and eventually translate it and provide a new
+Lucene query (to be added to the original one).
+Query time machine translation will be performed in the *MTFulltextQueryTermsProvider* only if the index definition of the
+selected index matches the node types defined in the *MTFulltextQueryTermsProvider* configuration (e.g. Oak:Unstructured).
+
+The *MTFulltextQueryTermsProvider* will try to perform the translation of the whole text first and, secondly, of the single
+tokens as they are created by the Lucene _Analyzer_ passed in the *#getQueryTerm(String text, Analyzer analyzer, NodeState indexDefinition)*
+API call.
+
+Machine Translation is currently implemented by means of Apache Joshua, a statistical machine translation toolkit.
+*MTFulltextQueryTermsProvider* will require a *language pack* (a SMT model) in order to perform translation of search queries.
+
+#### <a name="joshua"></a> Apache Joshua
+
+Apache Joshua is a statistical machine translation toolkit originally developed at Johns Hopkins University University of
+Pennsylvania, donated in 2015 to the Apache Software Foundation.
+For more information on the usage of Apache Joshua for multi language search see the slides/video from the Berlin Buzzwords 2017
+presentation [Embracing diversity: searching over multiple languages](https://berlinbuzzwords.de/17/session/embracing-diversity-searching-over-multiple-languages).
+
+##### <a name="languagepacks"></a> Language Packs
+
+Apache Joshua can be used to train machine translation models called _language packs_, however it provides a set
+of ready to use (Apache licensed) language packs for many language pairs at:
+
+[https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs](https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs)
+
+#### <a name="setup"></a> Setup
+
+Multiple *MTFulltextQueryTermsProvider* can be configured (for different language pairs) by using *MTFulltextQueryTermsProviderFactory*
+OSGi configuration factory.
+In order to instantiate a *MTFulltextQueryTermsProviderFactory* the following properties need to be configured:
+
+ * _path.to.config_ -> the path to the _joshua.config_ configuration file (e.g. of a downloaded language pack)
+ * _node.types_ -> the list of node types for which query time MT expansion should be done
+ * _min.score_ -> the minimum score (between 0 and 1) for a translated sentence / token to be used while expanding the query (this is used to filter out low quality translations)
+