You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Igor Shalyminov <is...@yandex-team.ru> on 2013/09/27 16:11:43 UTC
Indexing documents with multiple field values
Hello!
I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.)
I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer.
(For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following:
class AmbiguousTokenAnalyzer extends Analyzer {
private PayloadEncoder encoder = new IntegerEncoder();
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
sink.addAttribute(OffsetAttribute.class);
sink.addAttribute(CharTermAttribute.class);
sink.addAttribute(PayloadAttribute.class);
sink.addAttribute(PositionIncrementAttribute.class);
return new TokenStreamComponents(source, sink);
}
}
CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
The field is configured as:
attributeFieldType.setIndexed(true);
attributeFieldType.setStored(true);
attributeFieldType.setOmitNorms(true);
attributeFieldType.setTokenized(true);
attributeFieldType.setStoreTermVectorOffsets(true);
attributeFieldType.setStoreTermVectorPositions(true);
attributeFieldType.setStoreTermVectors(true);
attributeFieldType.setStoreTermVectorPayloads(true);
The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage.
As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
--
Best Regards,
Igor Shalyminov
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Indexing documents with multiple field values
Posted by Igor Shalyminov <is...@yandex-team.ru>.
Hi all!
A little bit more of exploration:)
After indexing with multiple atomic field values, here is what I get:
indexSearcher.doc(0).getFields("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000>
stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1>
stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1>
indexSearcher.doc(0).getField("gramm")
stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000>
The values are absolutely correct, but why getField() returns only the first one instead of concatenating them?
If I want to handcraft my custom highlighter, is iterating through (roughly) all the stored field values supposed to be the right technique? (Previously I was using Alanyzer.tokenStream.incrementToken() for the entire concatenated field.)
--
Igor
02.10.2013, 21:26, "Igor Shalyminov" <is...@yandex-team.ru>:
> Hi again!
>
> Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them.
> What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?
>
> --
> Igor
>
> 27.09.2013, 18:12, "Igor Shalyminov" <is...@yandex-team.ru>:
>
>> Hello!
>>
>> I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.)
>> I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer.
>> (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
>>
>> The analyzer is just the following:
>>
>> class AmbiguousTokenAnalyzer extends Analyzer {
>> private PayloadEncoder encoder = new IntegerEncoder();
>>
>> @Override
>> protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
>> Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
>> TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
>> sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
>> sink.addAttribute(OffsetAttribute.class);
>> sink.addAttribute(CharTermAttribute.class);
>> sink.addAttribute(PayloadAttribute.class);
>> sink.addAttribute(PositionIncrementAttribute.class);
>> return new TokenStreamComponents(source, sink);
>> }
>> }
>>
>> CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
>>
>> The field is configured as:
>> attributeFieldType.setIndexed(true);
>> attributeFieldType.setStored(true);
>> attributeFieldType.setOmitNorms(true);
>> attributeFieldType.setTokenized(true);
>> attributeFieldType.setStoreTermVectorOffsets(true);
>> attributeFieldType.setStoreTermVectorPositions(true);
>> attributeFieldType.setStoreTermVectors(true);
>> attributeFieldType.setStoreTermVectorPayloads(true);
>>
>> The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage.
>> As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
>>
>> --
>> Best Regards,
>> Igor Shalyminov
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Indexing documents with multiple field values
Posted by Igor Shalyminov <is...@yandex-team.ru>.
Hi again!
Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them.
What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?
--
Igor
27.09.2013, 18:12, "Igor Shalyminov" <is...@yandex-team.ru>:
> Hello!
>
> I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.)
> I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer.
> (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
>
> The analyzer is just the following:
>
> class AmbiguousTokenAnalyzer extends Analyzer {
> private PayloadEncoder encoder = new IntegerEncoder();
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
> Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader);
> TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
> sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
> sink.addAttribute(OffsetAttribute.class);
> sink.addAttribute(CharTermAttribute.class);
> sink.addAttribute(PayloadAttribute.class);
> sink.addAttribute(PositionIncrementAttribute.class);
> return new TokenStreamComponents(source, sink);
> }
> }
>
> CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed.
>
> The field is configured as:
> attributeFieldType.setIndexed(true);
> attributeFieldType.setStored(true);
> attributeFieldType.setOmitNorms(true);
> attributeFieldType.setTokenized(true);
> attributeFieldType.setStoreTermVectorOffsets(true);
> attributeFieldType.setStoreTermVectorPositions(true);
> attributeFieldType.setStoreTermVectors(true);
> attributeFieldType.setStoreTermVectorPayloads(true);
>
> The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage.
> As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something?
>
> --
> Best Regards,
> Igor Shalyminov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org