You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Lukas Kahwe Smith <sm...@pooteeweet.org> on 2016/07/12 08:15:22 UTC

multilingual content and indexing

Aloha,

I did a bit of search but didn’t find anything specific on any plans to dealing with multi language content in any specific way inside Oak. Specifically I am wondering as indexing all content from different languages together can lead to suboptimal sorting and needless overhead. So are there any plans to deal with this specifically?

If not inside Oak, are there any projects on top of Oak (or inside AEM) that deal with this?

Or is this basically considered to be a case where one needs to plugin a custom indexer and figure it out on your own?

regards,
Lukas Kahwe Smith
smith@pooteeweet.org




Re: multilingual content and indexing

Posted by Chetan Mehrotra <ch...@gmail.com>.
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith <sm...@pooteeweet.org> wrote:
>> Alternatively, you can create different index definitions for each subtree (see [1]), e.g. Using the “includedPaths” property. This would lead to smaller indexes at the downside that you would have to create an index definition if you add a new language tree.

Another way would be to have your index definition under each node

/content/en/oak:index/fooIndex
/content/jp/oak:index/fooIndex

And have each index config analyzer configured as per the language.

Chetan Mehrotra

Re: multilingual content and indexing

Posted by Lukas Kahwe Smith <sm...@pooteeweet.org>.
> On 12 Jul 2016, at 12:15, Michael Marth <mm...@adobe.com> wrote:
> 
> Hi Lukas,
> 
> I am not entirely sure what you want to achieve (or what exactly you mean with “dealing with multi language content”), but trying to answer a bit:
> 
> Let’s say you have distinct content trees for different languages, like e.g.
> /content/en
> /content/jp
> Etc.
> 
> You can choose to index all these trees in one (Lucene) index for full text search and filter the results in your query, i.e. Put the burden on the query engine.
> This is a simple setup which leads to a large index (although I personally have not seen this to be a problem)

for example if you index multi lingual content under the same field while doing monolingual searches, then you tend to have suboptimal sorting since word distributions values from one language affect word distribution of another.

> Alternatively, you can create different index definitions for each subtree (see [1]), e.g. Using the “includedPaths” property. This would lead to smaller indexes at the downside that you would have to create an index definition if you add a new language tree.
> This approach has the additional benefit that you can define language-specific Lucene analyzers for each sub tree, so that e.g. In the example above the Japanese index would have ist own analyzer.

ok, so its possible to tweak this with the standard indexer in Oak without having to switch to an external indexer like Solr just for this. good to hear.

regards,
Lukas Kahwe Smith
smith@pooteeweet.org




Re: multilingual content and indexing

Posted by Michael Marth <mm...@adobe.com>.
Hi Lukas,

I am not entirely sure what you want to achieve (or what exactly you mean with “dealing with multi language content”), but trying to answer a bit:

Let’s say you have distinct content trees for different languages, like e.g.
/content/en
/content/jp
Etc.

You can choose to index all these trees in one (Lucene) index for full text search and filter the results in your query, i.e. Put the burden on the query engine.
This is a simple setup which leads to a large index (although I personally have not seen this to be a problem)

Alternatively, you can create different index definitions for each subtree (see [1]), e.g. Using the “includedPaths” property. This would lead to smaller indexes at the downside that you would have to create an index definition if you add a new language tree.
This approach has the additional benefit that you can define language-specific Lucene analyzers for each sub tree, so that e.g. In the example above the Japanese index would have ist own analyzer.

HTH
Michael

[1] http://jackrabbit.apache.org/oak/docs/query/lucene.html



On 12/07/16 10:15, "Lukas Kahwe Smith" <sm...@pooteeweet.org> wrote:

>Aloha,
>
>I did a bit of search but didn’t find anything specific on any plans to dealing with multi language content in any specific way inside Oak. Specifically I am wondering as indexing all content from different languages together can lead to suboptimal sorting and needless overhead. So are there any plans to deal with this specifically?
>
>If not inside Oak, are there any projects on top of Oak (or inside AEM) that deal with this?
>
>Or is this basically considered to be a case where one needs to plugin a custom indexer and figure it out on your own?
>
>regards,
>Lukas Kahwe Smith
>smith@pooteeweet.org
>
>
>