You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2012/02/10 00:30:45 UTC

custom TokenFilter

I have the need to take user input and index it in a unique fashion,
essentially the value is some string (say "abcdefghijk") and needs to
be converted into a set of tokens (say 1 2 3 4).  I am currently have
implemented a custom TokenFilter to do this, is this appropriate?  In
cases where I am indexing things slowly (i.e. 1 at a time) this works
fine, but when I send 10,000 things to solr (all in one thread) I am
noticing exceptions where it seems that the generated instance
variable is being used by several threads.  Is my implementation
appropriate or is there another more appropriate way to do this?  Are
TokenFilters reused?  Would it be more appropriate to convert the
stream to 1 token space separated then run that through a
WhiteSpaceTokenizer?  Any guidance on this would be greatly
appreciated.

	class CustomFilter extends TokenFilter {
		private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
		private final PositionIncrementAttribute posAtt =
addAttribute(PositionIncrementAttribute.class);
		protected CustomFilter(TokenStream input) {
			super(input);
		}

		Iterator<AttributeSource> replacement;
		@Override
		public boolean incrementToken() throws IOException {

			
			if(generated == null){
				//setup generated
				if(!input.incrementToken()){
					return false;
				}
				
				//clearAttributes();
				List<String> cells = StaticClass.generateTokens(termAtt.toString());
				generated = new ArrayList<AttributeSource>(cells.size());
				boolean first = true;
				for(String cell : cells) {
					AttributeSource newTokenSource = this.cloneAttributes();
					
					CharTermAttribute newTermAtt =
newTokenSource.addAttribute(CharTermAttribute.class);
					newTermAtt.setEmpty();
					newTermAtt.append(cell);
					OffsetAttribute newOffsetAtt =
newTokenSource.addAttribute(OffsetAttribute.class);
			                PositionIncrementAttribute newPosIncAtt =
newTokenSource.addAttribute(PositionIncrementAttribute.class);
					newOffsetAtt.setOffset(0,0);
					newPosIncAtt.setPositionIncrement(first ? 1 : 0);
					generated.add(newTokenSource);
					first = false;
					generated.add(newTokenSource);
				}
				
			}
			if(!generated.isEmpty()){
				copy(this, generated.remove(0));
				return true;
			}
			
			return false;
			
		}

		private void copy(AttributeSource target, AttributeSource source) {
			if (target != source)
				source.copyTo(target);
		}

		private LinkedList<AttributeSource> buffer;
		private LinkedList<AttributeSource> matched;

		private boolean exhausted;

		private AttributeSource nextTok() throws IOException {
			if (buffer != null && !buffer.isEmpty()) {
				return buffer.removeFirst();
			} else {
				if (!exhausted && input.incrementToken()) {
					return this;
				} else {
					exhausted = true;
					return null;
				}
			}
		}
		@Override
		public void reset() throws IOException {
			super.reset();
			generated = null;
		}
	}

Re: custom TokenFilter

Posted by Jamie Johnson <je...@gmail.com>.

Think I figured it out, the tokens just needed the same position attribute.

On Thu, Feb 9, 2012 at 10:38 PM, Jamie Johnson <je...@gmail.com> wrote:
> Thanks Robert, worked perfect for the index side of the house.  Now on
> the query side I have a similar Tokenizer, but it's not operating
> quite the way I want it to.  The query tokenizer generates the tokens
> properly except I'm ending up with a phrase query, i.e. field:"1 2 3
> 4" when I really want field:1 OR field:2 OR field:3 OR field:4.  Is
> there something in the tokenizer that needs to be set for this to
> generate this type of query or is it something in the query parser?
>
> On Thu, Feb 9, 2012 at 9:02 PM, Robert Muir <rc...@gmail.com> wrote:
>> On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> Again thanks.  I'll take a stab at that are you aware of any
>>> resources/examples of how to do this?  I figured I'd start with
>>> WhiteSpaceTokenizer but wasn't sure if there was a simpler place to
>>> start.
>>>
>>
>> Well, easiest is if you can build what you need out of existing resources...
>>
>> But if you need to write your own, and If your input is not massive
>> documents/you have no problem processing the whole field in RAM at
>> once, you could try looking at PatternTokenizer for an example:
>>
>> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java
>>
>> --
>> lucidimagination.com

Re: custom TokenFilter

Posted by Jamie Johnson <je...@gmail.com>.

Thanks Robert, worked perfect for the index side of the house.  Now on
the query side I have a similar Tokenizer, but it's not operating
quite the way I want it to.  The query tokenizer generates the tokens
properly except I'm ending up with a phrase query, i.e. field:"1 2 3
4" when I really want field:1 OR field:2 OR field:3 OR field:4.  Is
there something in the tokenizer that needs to be set for this to
generate this type of query or is it something in the query parser?

On Thu, Feb 9, 2012 at 9:02 PM, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Again thanks.  I'll take a stab at that are you aware of any
>> resources/examples of how to do this?  I figured I'd start with
>> WhiteSpaceTokenizer but wasn't sure if there was a simpler place to
>> start.
>>
>
> Well, easiest is if you can build what you need out of existing resources...
>
> But if you need to write your own, and If your input is not massive
> documents/you have no problem processing the whole field in RAM at
> once, you could try looking at PatternTokenizer for an example:
>
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java
>
> --
> lucidimagination.com

Re: custom TokenFilter

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <je...@gmail.com> wrote:
> Again thanks.  I'll take a stab at that are you aware of any
> resources/examples of how to do this?  I figured I'd start with
> WhiteSpaceTokenizer but wasn't sure if there was a simpler place to
> start.
>

Well, easiest is if you can build what you need out of existing resources...

But if you need to write your own, and If your input is not massive
documents/you have no problem processing the whole field in RAM at
once, you could try looking at PatternTokenizer for an example:

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java

-- 
lucidimagination.com

Re: custom TokenFilter

Posted by Jamie Johnson <je...@gmail.com>.

Again thanks.  I'll take a stab at that are you aware of any
resources/examples of how to do this?  I figured I'd start with
WhiteSpaceTokenizer but wasn't sure if there was a simpler place to
start.

On Thu, Feb 9, 2012 at 8:44 PM, Robert Muir <rc...@gmail.com> wrote:
> On Thu, Feb 9, 2012 at 8:28 PM, Jamie Johnson <je...@gmail.com> wrote:
>> Thanks Robert, I'll take a look there.  Does it sound like I'm on the
>> right the right track with what I'm implementing, in other words is a
>> TokenFilter appropriate or is there something else that would be a
>> better fit for what I've described?
>
> I can't say for sure to be honest... because its a bit too
> abstract...I don't know the reasoning behind trying to convert
> "abcdefghijk" to 1 2 3 4, and I'm not sure I really understand what
> that means either.
>
> But in general: if you are taking the whole content of a field and
> making it into tokens, then its best implemented as a tokenizer.
>
> --
> lucidimagination.com

Re: custom TokenFilter

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Feb 9, 2012 at 8:28 PM, Jamie Johnson <je...@gmail.com> wrote:
> Thanks Robert, I'll take a look there.  Does it sound like I'm on the
> right the right track with what I'm implementing, in other words is a
> TokenFilter appropriate or is there something else that would be a
> better fit for what I've described?

I can't say for sure to be honest... because its a bit too
abstract...I don't know the reasoning behind trying to convert
"abcdefghijk" to 1 2 3 4, and I'm not sure I really understand what
that means either.

But in general: if you are taking the whole content of a field and
making it into tokens, then its best implemented as a tokenizer.

-- 
lucidimagination.com

Re: custom TokenFilter

Posted by Jamie Johnson <je...@gmail.com>.

Thanks Robert, I'll take a look there.  Does it sound like I'm on the
right the right track with what I'm implementing, in other words is a
TokenFilter appropriate or is there something else that would be a
better fit for what I've described?

On Thu, Feb 9, 2012 at 6:44 PM, Robert Muir <rc...@gmail.com> wrote:
> If you are writing a custom tokenstream, I recommend using some of the
> resources in Lucene's test-framework.jar to test it.
> These find lots of bugs! (including thread-safety bugs)
>
> For a filter: I recommend to use the assertions in
> BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo,
> and especially checkRandomData
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java
>
> When testing your filter, for even more checks, don't use Whitespace
> or Keyword Tokenizer, use MockTokenizer, it has more checks:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java
>
> For some examples, you can look at the tests in modules/analysis.
>
> And of course enable assertions (-ea) when testing!
>
> On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <je...@gmail.com> wrote:
>> I have the need to take user input and index it in a unique fashion,
>> essentially the value is some string (say "abcdefghijk") and needs to
>> be converted into a set of tokens (say 1 2 3 4).  I am currently have
>> implemented a custom TokenFilter to do this, is this appropriate?  In
>> cases where I am indexing things slowly (i.e. 1 at a time) this works
>> fine, but when I send 10,000 things to solr (all in one thread) I am
>> noticing exceptions where it seems that the generated instance
>> variable is being used by several threads.  Is my implementation
>> appropriate or is there another more appropriate way to do this?  Are
>> TokenFilters reused?  Would it be more appropriate to convert the
>> stream to 1 token space separated then run that through a
>> WhiteSpaceTokenizer?  Any guidance on this would be greatly
>> appreciated.
>>
>>        class CustomFilter extends TokenFilter {
>>                private final CharTermAttribute termAtt =
>> addAttribute(CharTermAttribute.class);
>>                private final PositionIncrementAttribute posAtt =
>> addAttribute(PositionIncrementAttribute.class);
>>                protected CustomFilter(TokenStream input) {
>>                        super(input);
>>                }
>>
>>                Iterator<AttributeSource> replacement;
>>                @Override
>>                public boolean incrementToken() throws IOException {
>>
>>
>>                        if(generated == null){
>>                                //setup generated
>>                                if(!input.incrementToken()){
>>                                        return false;
>>                                }
>>
>>                                //clearAttributes();
>>                                List<String> cells = StaticClass.generateTokens(termAtt.toString());
>>                                generated = new ArrayList<AttributeSource>(cells.size());
>>                                boolean first = true;
>>                                for(String cell : cells) {
>>                                        AttributeSource newTokenSource = this.cloneAttributes();
>>
>>                                        CharTermAttribute newTermAtt =
>> newTokenSource.addAttribute(CharTermAttribute.class);
>>                                        newTermAtt.setEmpty();
>>                                        newTermAtt.append(cell);
>>                                        OffsetAttribute newOffsetAtt =
>> newTokenSource.addAttribute(OffsetAttribute.class);
>>                                        PositionIncrementAttribute newPosIncAtt =
>> newTokenSource.addAttribute(PositionIncrementAttribute.class);
>>                                        newOffsetAtt.setOffset(0,0);
>>                                        newPosIncAtt.setPositionIncrement(first ? 1 : 0);
>>                                        generated.add(newTokenSource);
>>                                        first = false;
>>                                        generated.add(newTokenSource);
>>                                }
>>
>>                        }
>>                        if(!generated.isEmpty()){
>>                                copy(this, generated.remove(0));
>>                                return true;
>>                        }
>>
>>                        return false;
>>
>>                }
>>
>>                private void copy(AttributeSource target, AttributeSource source) {
>>                        if (target != source)
>>                                source.copyTo(target);
>>                }
>>
>>                private LinkedList<AttributeSource> buffer;
>>                private LinkedList<AttributeSource> matched;
>>
>>                private boolean exhausted;
>>
>>                private AttributeSource nextTok() throws IOException {
>>                        if (buffer != null && !buffer.isEmpty()) {
>>                                return buffer.removeFirst();
>>                        } else {
>>                                if (!exhausted && input.incrementToken()) {
>>                                        return this;
>>                                } else {
>>                                        exhausted = true;
>>                                        return null;
>>                                }
>>                        }
>>                }
>>                @Override
>>                public void reset() throws IOException {
>>                        super.reset();
>>                        generated = null;
>>                }
>>        }
>
>
>
> --
> lucidimagination.com

Re: custom TokenFilter

Posted by Robert Muir <rc...@gmail.com>.

If you are writing a custom tokenstream, I recommend using some of the
resources in Lucene's test-framework.jar to test it.
These find lots of bugs! (including thread-safety bugs)

For a filter: I recommend to use the assertions in
BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo,
and especially checkRandomData
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java

When testing your filter, for even more checks, don't use Whitespace
or Keyword Tokenizer, use MockTokenizer, it has more checks:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java

For some examples, you can look at the tests in modules/analysis.

And of course enable assertions (-ea) when testing!

On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <je...@gmail.com> wrote:
> I have the need to take user input and index it in a unique fashion,
> essentially the value is some string (say "abcdefghijk") and needs to
> be converted into a set of tokens (say 1 2 3 4).  I am currently have
> implemented a custom TokenFilter to do this, is this appropriate?  In
> cases where I am indexing things slowly (i.e. 1 at a time) this works
> fine, but when I send 10,000 things to solr (all in one thread) I am
> noticing exceptions where it seems that the generated instance
> variable is being used by several threads.  Is my implementation
> appropriate or is there another more appropriate way to do this?  Are
> TokenFilters reused?  Would it be more appropriate to convert the
> stream to 1 token space separated then run that through a
> WhiteSpaceTokenizer?  Any guidance on this would be greatly
> appreciated.
>
>        class CustomFilter extends TokenFilter {
>                private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
>                private final PositionIncrementAttribute posAtt =
> addAttribute(PositionIncrementAttribute.class);
>                protected CustomFilter(TokenStream input) {
>                        super(input);
>                }
>
>                Iterator<AttributeSource> replacement;
>                @Override
>                public boolean incrementToken() throws IOException {
>
>
>                        if(generated == null){
>                                //setup generated
>                                if(!input.incrementToken()){
>                                        return false;
>                                }
>
>                                //clearAttributes();
>                                List<String> cells = StaticClass.generateTokens(termAtt.toString());
>                                generated = new ArrayList<AttributeSource>(cells.size());
>                                boolean first = true;
>                                for(String cell : cells) {
>                                        AttributeSource newTokenSource = this.cloneAttributes();
>
>                                        CharTermAttribute newTermAtt =
> newTokenSource.addAttribute(CharTermAttribute.class);
>                                        newTermAtt.setEmpty();
>                                        newTermAtt.append(cell);
>                                        OffsetAttribute newOffsetAtt =
> newTokenSource.addAttribute(OffsetAttribute.class);
>                                        PositionIncrementAttribute newPosIncAtt =
> newTokenSource.addAttribute(PositionIncrementAttribute.class);
>                                        newOffsetAtt.setOffset(0,0);
>                                        newPosIncAtt.setPositionIncrement(first ? 1 : 0);
>                                        generated.add(newTokenSource);
>                                        first = false;
>                                        generated.add(newTokenSource);
>                                }
>
>                        }
>                        if(!generated.isEmpty()){
>                                copy(this, generated.remove(0));
>                                return true;
>                        }
>
>                        return false;
>
>                }
>
>                private void copy(AttributeSource target, AttributeSource source) {
>                        if (target != source)
>                                source.copyTo(target);
>                }
>
>                private LinkedList<AttributeSource> buffer;
>                private LinkedList<AttributeSource> matched;
>
>                private boolean exhausted;
>
>                private AttributeSource nextTok() throws IOException {
>                        if (buffer != null && !buffer.isEmpty()) {
>                                return buffer.removeFirst();
>                        } else {
>                                if (!exhausted && input.incrementToken()) {
>                                        return this;
>                                } else {
>                                        exhausted = true;
>                                        return null;
>                                }
>                        }
>                }
>                @Override
>                public void reset() throws IOException {
>                        super.reset();
>                        generated = null;
>                }
>        }



-- 
lucidimagination.com