You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shyamsunder Mutcha <sj...@gmail.com> on 2019/10/30 23:16:24 UTC

Consuming token stream more than once in same filter

I have a requirement to handle synonyms differently based on the first word
(token) in the text field of the document. I have implemented custom
SynFilterFactory which loads synonyms per languages when core/solr is
started.

Now in the MySynonymFilterFactory#create(TokenStream input) method, I have
to read the first token from the input TokenStream. Based on that token
value, corresponding SynonymMap will be used for SynonymFilter creation.

Here are my documents
doc1 <text>lang_eng this is English language text</text>
doc2 <text>lang_fra this is French language text</text>
doc3 <text>lang_spa this is Spanish language text</text>

MySynonymFilterFactory creates MySynonymFilter. Method create() logic is
below...

@Override

*public* TokenStream create(TokenStream input) {

// if the fst is null, it means there's actually no synonyms... just return
the

// original stream as there is nothing to do here.

// return map.fst == null ? input : new MySynonymFilter(input, map,
ignoreCase);

System.*out*.println("input=" + input);

// some how read the TokenStream here to capture the lang value

SynonymMap synonyms = *null*;

*try* {

CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.*class*);

*boolean* first = *false*;

input.reset();

*while* (!first && input.incrementToken()) {

String term = *new* String(termAtt.buffer(), 0, termAtt.length());

System.*out*.println("termAtt=" + term);

*if* (StringUtils.*startsWith*(term, "lang_")) {

String[] split = StringUtils.*split*(term, "_");

String lang = split[1];

String key = (langSynMap.containsKey(lang)) ? lang : "generic";

synonyms = langSynMap.get(key);

System.*out*.println("synonyms=" + synonyms);

}

first = *true*;

}

} *catch* (IOException e) {

// *TODO* Auto-generated catch block

e.printStackTrace();

}


*return* synonyms == *null* ? input : *new* SynonymFilter(input, synonyms,
ignoreCase);

}

This code compiles and this new analysis works fine in the Solr admin
analysis screen. But same fails with below exception when I try to index a
document
30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id id1 to
the index; possible analysis error.
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
        at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
Caused by: java.lang.IllegalStateException: TokenStream contract violation:
reset()/close() call missing, reset() called multiple times, or subclass
does not call super.reset(). Please see Java
docs of TokenStream class for more information about the correct consuming
workflow.
        at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
        at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
        at
com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
        at
org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
        at org.apache.lucene.document.Field.tokenStream(Field.java:562)
        at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
        at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
        ... 37 more

Any idea how can I read a token stream with out violating the token stream
contract. I see a similar discussion here
https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html,
but doesn't help solve my problem.

Also how come same error is not reported when analyzing the field value
using Solr admin console analysis screen.

Thanks

Re: Consuming token stream more than once in same filter

Posted by Shyamsunder Mutcha <sj...@gmail.com>.

Other way to understand what I am trying to solve here, consider two documents like this

doc1: 
<doc>
 <lang>eng</lang>
 <text>English text</text>
<doc>

doc2:
<doc>
 <lang>fra</lang>
 <text>French text</text>
<doc>

So analysis time for text field, is there a way for me know the lang field value. I see that Analyzer/Tokenizer has no reference to other field value. So what I am doing is in the Update Request Processor, I am adding the lang value as first word. So here is what my document looks like after URP

<doc>
 <lang>eng</lang>
 <text>lang_eng English text</text>
<doc>

doc2:
<doc>
 <lang>fra</lang>
 <text>lang_fra French text</text>
<doc>

So in my down stream tokenizer/tokenfilter classes, I plan to read the first token and it starts with lang_ value, then read the language value (eng/fra), and apply corresponding token filter. Also I skip this first token if it starts with "lang_", so this way this token is not added to the index.

Sounds like a crazy problem as I am trying to add all language documents into one single core. If I am creating a core per each language, this problem is solved out of box.

Thanks

On 2019/10/31 10:49:31, Shyamsunder Mutcha <sj...@gmail.com> wrote: 
> Thanks for your inputs Michael. Looks like ConditionalTokenFilter is introduced from Lucene 7.4 version. I have implemented similar approach where I read the first token from input stream in MySynonymFilterFactory and load the language specific SynonymMap to create MySynonymFilter. This approach seems to be working when I tested the field analysis using Solr admin console analysis page. But same reports a error message when index a document.
> 
> When I try to ping the TokenStream input parameter, I see it has two different class types
> 
> **Admin console Analysis***
> input=ListBasedTokenStream@3a90e594 term=,bytes=[],startOffset=13,endOffset=13,positionIncrement=0,positionLength=1,type=word,position=3,positionHistory=[Ljava.lang.Integer;@34571cb6
> 
> **Solr indexing***
> input=StandardTokenizer@41448310 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word
> 
> I have added a input.reset/close/end method in the constructor of MySynonymFilterFactory, but that didn't work. So I need a way to consume a token stream twice with out breaking the contract
> - consume it once to read the first token 
> - then reset it to original state
> - consume it in the SynonymFilter where synonyms are applied as per the first token which is saved as variable in that SynonymFilter
> 
> Thanks
> 
> On 2019/10/31 08:54:02, Michael Sokolov <ms...@gmail.com> wrote: 
> > Are you able to:
> > 1) create a custom attribute encoding the language
> > 2) create a filter that sets the attribute when it reads the first token
> > 3) wrap your synonym filters (one for each language) in a
> > ConditionalTokenFilter that filters based on the language attribute
> > 
> > On Wed, Oct 30, 2019 at 11:16 PM Shyamsunder Mutcha <sj...@gmail.com> wrote:
> > >
> > > I have a requirement to handle synonyms differently based on the first word (token) in the text field of the document. I have implemented custom SynFilterFactory which loads synonyms per languages when core/solr is started.
> > >
> > > Now in the MySynonymFilterFactory#create(TokenStream input) method, I have to read the first token from the input TokenStream. Based on that token value, corresponding SynonymMap will be used for SynonymFilter creation.
> > >
> > > Here are my documents
> > > doc1 <text>lang_eng this is English language text</text>
> > > doc2 <text>lang_fra this is French language text</text>
> > > doc3 <text>lang_spa this is Spanish language text</text>
> > >
> > > MySynonymFilterFactory creates MySynonymFilter. Method create() logic is below...
> > >
> > > @Override
> > >
> > > public TokenStream create(TokenStream input) {
> > >
> > > // if the fst is null, it means there's actually no synonyms... just return the
> > >
> > > // original stream as there is nothing to do here.
> > >
> > > // return map.fst == null ? input : new MySynonymFilter(input, map, ignoreCase);
> > >
> > > System.out.println("input=" + input);
> > >
> > > // some how read the TokenStream here to capture the lang value
> > >
> > > SynonymMap synonyms = null;
> > >
> > > try {
> > >
> > > CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.class);
> > >
> > > boolean first = false;
> > >
> > > input.reset();
> > >
> > > while (!first && input.incrementToken()) {
> > >
> > > String term = new String(termAtt.buffer(), 0, termAtt.length());
> > >
> > > System.out.println("termAtt=" + term);
> > >
> > > if (StringUtils.startsWith(term, "lang_")) {
> > >
> > > String[] split = StringUtils.split(term, "_");
> > >
> > > String lang = split[1];
> > >
> > > String key = (langSynMap.containsKey(lang)) ? lang : "generic";
> > >
> > > synonyms = langSynMap.get(key);
> > >
> > > System.out.println("synonyms=" + synonyms);
> > >
> > > }
> > >
> > > first = true;
> > >
> > > }
> > >
> > > } catch (IOException e) {
> > >
> > > // TODO Auto-generated catch block
> > >
> > > e.printStackTrace();
> > >
> > > }
> > >
> > >
> > > return synonyms == null ? input : new SynonymFilter(input, synonyms, ignoreCase);
> > >
> > > }
> > >
> > >
> > > This code compiles and this new analysis works fine in the Solr admin analysis screen. But same fails with below exception when I try to index a document
> > > 30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id id1 to the index; possible analysis error.
> > >         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
> > >         at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
> > >         at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
> > >         at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
> > >         at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
> > >         at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
> > >         at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> > > Caused by: java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Java
> > > docs of TokenStream class for more information about the correct consuming workflow.
> > >         at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
> > >         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
> > >         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
> > >         at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
> > >         at com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
> > >         at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
> > >         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
> > >         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
> > >         at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
> > >         at org.apache.lucene.document.Field.tokenStream(Field.java:562)
> > >         at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
> > >         at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
> > >         at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
> > >         at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
> > >         at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
> > >         at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
> > >         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
> > >         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
> > >         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
> > >         ... 37 more
> > >
> > > Any idea how can I read a token stream with out violating the token stream contract. I see a similar discussion here https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html, but doesn't help solve my problem.
> > >
> > > Also how come same error is not reported when analyzing the field value using Solr admin console analysis screen.
> > >
> > > Thanks
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Consuming token stream more than once in same filter

Posted by Shyamsunder Mutcha <sj...@gmail.com>.

Thanks for your inputs Michael. Looks like ConditionalTokenFilter is introduced from Lucene 7.4 version. I have implemented similar approach where I read the first token from input stream in MySynonymFilterFactory and load the language specific SynonymMap to create MySynonymFilter. This approach seems to be working when I tested the field analysis using Solr admin console analysis page. But same reports a error message when index a document.

When I try to ping the TokenStream input parameter, I see it has two different class types

**Admin console Analysis***
input=ListBasedTokenStream@3a90e594 term=,bytes=[],startOffset=13,endOffset=13,positionIncrement=0,positionLength=1,type=word,position=3,positionHistory=[Ljava.lang.Integer;@34571cb6

**Solr indexing***
input=StandardTokenizer@41448310 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word

I have added a input.reset/close/end method in the constructor of MySynonymFilterFactory, but that didn't work. So I need a way to consume a token stream twice with out breaking the contract
- consume it once to read the first token 
- then reset it to original state
- consume it in the SynonymFilter where synonyms are applied as per the first token which is saved as variable in that SynonymFilter

Thanks

On 2019/10/31 08:54:02, Michael Sokolov <ms...@gmail.com> wrote: 
> Are you able to:
> 1) create a custom attribute encoding the language
> 2) create a filter that sets the attribute when it reads the first token
> 3) wrap your synonym filters (one for each language) in a
> ConditionalTokenFilter that filters based on the language attribute
> 
> On Wed, Oct 30, 2019 at 11:16 PM Shyamsunder Mutcha <sj...@gmail.com> wrote:
> >
> > I have a requirement to handle synonyms differently based on the first word (token) in the text field of the document. I have implemented custom SynFilterFactory which loads synonyms per languages when core/solr is started.
> >
> > Now in the MySynonymFilterFactory#create(TokenStream input) method, I have to read the first token from the input TokenStream. Based on that token value, corresponding SynonymMap will be used for SynonymFilter creation.
> >
> > Here are my documents
> > doc1 <text>lang_eng this is English language text</text>
> > doc2 <text>lang_fra this is French language text</text>
> > doc3 <text>lang_spa this is Spanish language text</text>
> >
> > MySynonymFilterFactory creates MySynonymFilter. Method create() logic is below...
> >
> > @Override
> >
> > public TokenStream create(TokenStream input) {
> >
> > // if the fst is null, it means there's actually no synonyms... just return the
> >
> > // original stream as there is nothing to do here.
> >
> > // return map.fst == null ? input : new MySynonymFilter(input, map, ignoreCase);
> >
> > System.out.println("input=" + input);
> >
> > // some how read the TokenStream here to capture the lang value
> >
> > SynonymMap synonyms = null;
> >
> > try {
> >
> > CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.class);
> >
> > boolean first = false;
> >
> > input.reset();
> >
> > while (!first && input.incrementToken()) {
> >
> > String term = new String(termAtt.buffer(), 0, termAtt.length());
> >
> > System.out.println("termAtt=" + term);
> >
> > if (StringUtils.startsWith(term, "lang_")) {
> >
> > String[] split = StringUtils.split(term, "_");
> >
> > String lang = split[1];
> >
> > String key = (langSynMap.containsKey(lang)) ? lang : "generic";
> >
> > synonyms = langSynMap.get(key);
> >
> > System.out.println("synonyms=" + synonyms);
> >
> > }
> >
> > first = true;
> >
> > }
> >
> > } catch (IOException e) {
> >
> > // TODO Auto-generated catch block
> >
> > e.printStackTrace();
> >
> > }
> >
> >
> > return synonyms == null ? input : new SynonymFilter(input, synonyms, ignoreCase);
> >
> > }
> >
> >
> > This code compiles and this new analysis works fine in the Solr admin analysis screen. But same fails with below exception when I try to index a document
> > 30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id id1 to the index; possible analysis error.
> >         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
> >         at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
> >         at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
> >         at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
> >         at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
> >         at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
> >         at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> > Caused by: java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Java
> > docs of TokenStream class for more information about the correct consuming workflow.
> >         at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
> >         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
> >         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
> >         at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
> >         at com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
> >         at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
> >         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
> >         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
> >         at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
> >         at org.apache.lucene.document.Field.tokenStream(Field.java:562)
> >         at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
> >         at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
> >         at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
> >         at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
> >         at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
> >         at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
> >         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
> >         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
> >         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
> >         ... 37 more
> >
> > Any idea how can I read a token stream with out violating the token stream contract. I see a similar discussion here https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html, but doesn't help solve my problem.
> >
> > Also how come same error is not reported when analyzing the field value using Solr admin console analysis screen.
> >
> > Thanks
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Consuming token stream more than once in same filter

Posted by Michael Sokolov <ms...@gmail.com>.

Are you able to:
1) create a custom attribute encoding the language
2) create a filter that sets the attribute when it reads the first token
3) wrap your synonym filters (one for each language) in a
ConditionalTokenFilter that filters based on the language attribute

On Wed, Oct 30, 2019 at 11:16 PM Shyamsunder Mutcha <sj...@gmail.com> wrote:
>
> I have a requirement to handle synonyms differently based on the first word (token) in the text field of the document. I have implemented custom SynFilterFactory which loads synonyms per languages when core/solr is started.
>
> Now in the MySynonymFilterFactory#create(TokenStream input) method, I have to read the first token from the input TokenStream. Based on that token value, corresponding SynonymMap will be used for SynonymFilter creation.
>
> Here are my documents
> doc1 <text>lang_eng this is English language text</text>
> doc2 <text>lang_fra this is French language text</text>
> doc3 <text>lang_spa this is Spanish language text</text>
>
> MySynonymFilterFactory creates MySynonymFilter. Method create() logic is below...
>
> @Override
>
> public TokenStream create(TokenStream input) {
>
> // if the fst is null, it means there's actually no synonyms... just return the
>
> // original stream as there is nothing to do here.
>
> // return map.fst == null ? input : new MySynonymFilter(input, map, ignoreCase);
>
> System.out.println("input=" + input);
>
> // some how read the TokenStream here to capture the lang value
>
> SynonymMap synonyms = null;
>
> try {
>
> CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.class);
>
> boolean first = false;
>
> input.reset();
>
> while (!first && input.incrementToken()) {
>
> String term = new String(termAtt.buffer(), 0, termAtt.length());
>
> System.out.println("termAtt=" + term);
>
> if (StringUtils.startsWith(term, "lang_")) {
>
> String[] split = StringUtils.split(term, "_");
>
> String lang = split[1];
>
> String key = (langSynMap.containsKey(lang)) ? lang : "generic";
>
> synonyms = langSynMap.get(key);
>
> System.out.println("synonyms=" + synonyms);
>
> }
>
> first = true;
>
> }
>
> } catch (IOException e) {
>
> // TODO Auto-generated catch block
>
> e.printStackTrace();
>
> }
>
>
> return synonyms == null ? input : new SynonymFilter(input, synonyms, ignoreCase);
>
> }
>
>
> This code compiles and this new analysis works fine in the Solr admin analysis screen. But same fails with below exception when I try to index a document
> 30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id id1 to the index; possible analysis error.
>         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
>         at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
>         at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
>         at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
>         at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> Caused by: java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Java
> docs of TokenStream class for more information about the correct consuming workflow.
>         at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
>         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
>         at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
>         at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
>         at com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
>         at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
>         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
>         at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
>         at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
>         at org.apache.lucene.document.Field.tokenStream(Field.java:562)
>         at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
>         at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
>         at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
>         at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
>         at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
>         at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
>         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
>         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
>         at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
>         ... 37 more
>
> Any idea how can I read a token stream with out violating the token stream contract. I see a similar discussion here https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html, but doesn't help solve my problem.
>
> Also how come same error is not reported when analyzing the field value using Solr admin console analysis screen.
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org