You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Tomlinson <ch...@gmail.com> on 2018/04/11 13:33:56 UTC

analyzer context during search

Hello,

I’m working on a project where it would be most helpful for getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have access to more than just the fieldName.

The scenario is that we are working with several languages: Tibetan, Sanskrit and Chinese; that each have several encodings, e.g., Simplified Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is from many sources which each use a variety of encodings and we wish to preserve the original encodings used in the data.

For Chinese, for example, we have an analyzer that creates a TokenStream of Pinyin with diacritics for any of the input encodings. Thus it is possible in some situations to retrieve documents originally input as zh-hans and so on.

The same applies to the other languages.

One objective is to allow the user to input a query in zh-pinyin, for example, and to retrieve documents that were originally indexed in any of the variant encodings.

The current scheme, in Apache Jena + Lucene, is to create a fieldName that includes the original name plus a language tag, e.g., label_zh-hans, so that the getWrappedAnalyzer() can then retrieve a registered analyzer for zh-hans that will then index using Pinyin tokens as mentioned above.

For Chinese, we end up with documents that have four different fields: label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so that when indexing we know what input encoding was used so that an appropriate analyzer configuration can be chosen since the analyzer has to be aware of the incoming encoding.

At search time we could try a search like:

    (label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin)

But this can not work since the information that the query is in zh-pinyin is not available to the getWrappedAnalyzer(), only the original encoding is available as a part of the field name and so it is not possible to know that the query string is in zh-pinyin so that is tokenized correctly when querying the other fields.

I’m probably over-thinking things, but it seems to me that if I had a way of accessing additional context when choosing an analyzer so that there would be information that the query string is in pinyin and the various field names are available as usual.

I don’t see how a custom query analyzer would help here. We would know that the context of the call to the analyzer wrapper was for query versus indexing but we still don’t know the field name versus the encoding of the query.

I imagine this sort of scenario has been solved by others numerous times but I’m stumped as to how to implement.

Thanks in advance for any help,
Chris

Re: analyzer context during search

Posted by Chris Tomlinson <ch...@gmail.com>.

Hi,

Thanks for the thoughts. I agree a combinatorial explosion of fields and index size would “solve” the problem but the cost is rather absurd. Hence, I posed the problem to prompt some discussion about what a plausible/reasonable solution might be.

It has seemed to be for some time that there really should be an extension of the Analyzer api to include a generic argument of abstract class AnalyzerContext that could optionaly be used via IndexWriter snd IndexSearcher to supply useful context information from the caller to IndexWriter and IndexSearcher.

This would require threading the parameter throughout much as was done versions ago with the Version argument.

Another approach might be to instantiate an Analyzer on each use of at least IndexSearcher so that a custom analyzer with context information could be provided; however, the cost of frequent instantiation of analyzers seems to be likely non-performant.

LUCENE-8240 did not appear to me to be a solution direction.

Thanks,
Chris


> On Apr 12, 2018, at 5:24 AM, Michael Sokolov <ms...@gmail.com> wrote:
> 
> I think you can achieve what you are asking by having a field for every
> possible combination of pairs of input and output. Obviously this would
> explode the size of your index, so it's not ideal.
> 
> Another alternative would be indexing all variants into a single field,
> using different analyzers for different inputs. Doing this requires extra
> context when choosing the analyzer (our the token streams that it
> generates) as you say. See http://issues.apache.org/jira/browse/LUCENE-8240
> for one idea of how to accomplish this.
> 
> 
> 
> On Wed, Apr 11, 2018, 9:34 AM Chris Tomlinson <ch...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> I’m working on a project where it would be most helpful for
>> getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have
>> access to more than just the fieldName.
>> 
>> The scenario is that we are working with several languages: Tibetan,
>> Sanskrit and Chinese; that each have several encodings, e.g., Simplified
>> Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics
>> (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is
>> from many sources which each use a variety of encodings and we wish to
>> preserve the original encodings used in the data.
>> 
>> For Chinese, for example, we have an analyzer that creates a TokenStream
>> of Pinyin with diacritics for any of the input encodings. Thus it is
>> possible in some situations to retrieve documents originally input as
>> zh-hans and so on.
>> 
>> The same applies to the other languages.
>> 
>> One objective is to allow the user to input a query in zh-pinyin, for
>> example, and to retrieve documents that were originally indexed in any of
>> the variant encodings.
>> 
>> The current scheme, in Apache Jena + Lucene, is to create a fieldName that
>> includes the original name plus a language tag, e.g., label_zh-hans, so
>> that the getWrappedAnalyzer() can then retrieve a registered analyzer for
>> zh-hans that will then index using Pinyin tokens as mentioned above.
>> 
>> For Chinese, we end up with documents that have four different fields:
>> label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so
>> that when indexing we know what input encoding was used so that an
>> appropriate analyzer configuration can be chosen since the analyzer has to
>> be aware of the incoming encoding.
>> 
>> At search time we could try a search like:
>> 
>>    (label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR
>> label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin)
>> 
>> But this can not work since the information that the query is in zh-pinyin
>> is not available to the getWrappedAnalyzer(), only the original encoding is
>> available as a part of the field name and so it is not possible to know
>> that the query string is in zh-pinyin so that is tokenized correctly when
>> querying the other fields.
>> 
>> I’m probably over-thinking things, but it seems to me that if I had a way
>> of accessing additional context when choosing an analyzer so that there
>> would be information that the query string is in pinyin and the various
>> field names are available as usual.
>> 
>> I don’t see how a custom query analyzer would help here. We would know
>> that the context of the call to the analyzer wrapper was for query versus
>> indexing but we still don’t know the field name versus the encoding of the
>> query.
>> 
>> I imagine this sort of scenario has been solved by others numerous times
>> but I’m stumped as to how to implement.
>> 
>> Thanks in advance for any help,
>> Chris
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: analyzer context during search

Posted by Michael Sokolov <ms...@gmail.com>.

I think you can achieve what you are asking by having a field for every
possible combination of pairs of input and output. Obviously this would
explode the size of your index, so it's not ideal.

Another alternative would be indexing all variants into a single field,
using different analyzers for different inputs. Doing this requires extra
context when choosing the analyzer (our the token streams that it
generates) as you say. See http://issues.apache.org/jira/browse/LUCENE-8240
for one idea of how to accomplish this.



On Wed, Apr 11, 2018, 9:34 AM Chris Tomlinson <ch...@gmail.com>
wrote:

> Hello,
>
> I’m working on a project where it would be most helpful for
> getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have
> access to more than just the fieldName.
>
> The scenario is that we are working with several languages: Tibetan,
> Sanskrit and Chinese; that each have several encodings, e.g., Simplified
> Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics
> (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is
> from many sources which each use a variety of encodings and we wish to
> preserve the original encodings used in the data.
>
> For Chinese, for example, we have an analyzer that creates a TokenStream
> of Pinyin with diacritics for any of the input encodings. Thus it is
> possible in some situations to retrieve documents originally input as
> zh-hans and so on.
>
> The same applies to the other languages.
>
> One objective is to allow the user to input a query in zh-pinyin, for
> example, and to retrieve documents that were originally indexed in any of
> the variant encodings.
>
> The current scheme, in Apache Jena + Lucene, is to create a fieldName that
> includes the original name plus a language tag, e.g., label_zh-hans, so
> that the getWrappedAnalyzer() can then retrieve a registered analyzer for
> zh-hans that will then index using Pinyin tokens as mentioned above.
>
> For Chinese, we end up with documents that have four different fields:
> label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so
> that when indexing we know what input encoding was used so that an
> appropriate analyzer configuration can be chosen since the analyzer has to
> be aware of the incoming encoding.
>
> At search time we could try a search like:
>
>     (label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR
> label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin)
>
> But this can not work since the information that the query is in zh-pinyin
> is not available to the getWrappedAnalyzer(), only the original encoding is
> available as a part of the field name and so it is not possible to know
> that the query string is in zh-pinyin so that is tokenized correctly when
> querying the other fields.
>
> I’m probably over-thinking things, but it seems to me that if I had a way
> of accessing additional context when choosing an analyzer so that there
> would be information that the query string is in pinyin and the various
> field names are available as usual.
>
> I don’t see how a custom query analyzer would help here. We would know
> that the context of the call to the analyzer wrapper was for query versus
> indexing but we still don’t know the field name versus the encoding of the
> query.
>
> I imagine this sort of scenario has been solved by others numerous times
> but I’m stumped as to how to implement.
>
> Thanks in advance for any help,
> Chris
>
>