You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ta...@controldocs.com on 2007/12/20 18:43:02 UTC

Changing the Punctuation definition for StandardAnalyzer

I am using StandardAnalyzer for my indexes. Now I don't want to be able to
be search whole email addresses, and want to consider '@' as a punctuation
too. Because my users would rather be able to search for user id and/or
the host name to return all the email addresses than searching by the
whole address. And, that way, then can create a query that will return
email addresses anyway.

How do I let StandardAnalyzer consider '@' as a punctuation?

Thanks
Tareque

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.

I actually hadn't implemented the TokenFilter solution before deciding not
to go with that solution, so didn't have any benchmark.

But eventually I have taken care of this problem with a different
variation of your quick and dirty solution. I have captured the character
'@' in FastCharStream.java, and replaced it with a blank space. That took
care of it.

Thanks for your help!
Tareque

> 20 dec 2007 kl. 22.32 skrev tareque@controldocs.com:
>
>> In fact I had previously located the grammar in StandardTokenizer.jj
>> (just wasn't sure if that was the one u were talking about) and had
>> commented out EMAIL entries from all the following files:
>>
>> StandardTokenizer.java
>> StandardTokenizer.jj
>> StandardTokenizerConstants.java
>>
>> Now what is puzzling to me is that though I don't see the '@'
>
> I think you'll find the JavaCC-list a much better forum for this
> question. You do however seem a bit confused about the fact that
> StandardTokenizer and StandardTokenierConstants are the generated
> artifacts via Ant build, based on StandardTokenizer.jj.
>
> Why was the TokenFilter solution not good enough? What was the results
> from your benchmarks?
>
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.

20 dec 2007 kl. 22.32 skrev tareque@controldocs.com:

> In fact I had previously located the grammar in StandardTokenizer.jj  
> (just wasn't sure if that was the one u were talking about) and had  
> commented out EMAIL entries from all the following files:
>
> StandardTokenizer.java
> StandardTokenizer.jj
> StandardTokenizerConstants.java
>
> Now what is puzzling to me is that though I don't see the '@'

I think you'll find the JavaCC-list a much better forum for this  
question. You do however seem a bit confused about the fact that  
StandardTokenizer and StandardTokenierConstants are the generated  
artifacts via Ant build, based on StandardTokenizer.jj.

Why was the TokenFilter solution not good enough? What was the results  
from your benchmarks?


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.

Karl,

I should have mentioned before, I have Lucene 1.9.1.

In fact I had previously located the grammar in StandardTokenizer.jj (just
wasn't sure if that was the one u were talking about) and had commented
out EMAIL entries from all the following files:

StandardTokenizer.java
StandardTokenizer.jj
StandardTokenizerConstants.java

But evidently the tokenizer was expecting the email addresses to be one of
the other TOKEN types. But since they were matching with none of them it
was throwing a ParseException.

Now what is puzzling to me is that though I don't see the '@' (unicode
value 0040) sign to be included in "LETTER" or any other definition, why
is it not  splitting the words? It certainly isn't, which is why Tokenizer
is expecting the email address to be defined as a TYPE. My understanding,
looking at the code, is that whichever characters were not defined in the
grammar, would be acting as splitter, since they are not contributing to
any TOKEN definition.

Please let me know what I am missing.

Thanks
Tareque

>
> 20 dec 2007 kl. 20.21 skrev tareque@controldocs.com:
>
>> I would rather like to modify the lexer grammar. But exactly where
>> it is
>> defined. After having a quick look, seems like
>> StandardTokenizerTokenManager.java may be where it is being done.
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> It can be generated with the Ant build.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.

20 dec 2007 kl. 20.21 skrev tareque@controldocs.com:

> I would rather like to modify the lexer grammar. But exactly where  
> it is
> defined. After having a quick look, seems like
> StandardTokenizerTokenManager.java may be where it is being done.

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

It can be generated with the Ant build.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.

Thanks Karl,

I would rather like to modify the lexer grammar. But exactly where it is
defined. After having a quick look, seems like
StandardTokenizerTokenManager.java may be where it is being done.
Ampersand having a decimal value of '38', I was assuming that the
following step is taken when faced with ampersand:

=============
              case 73:
                  if (curChar == 38)
                     jjstateSet[jjnewStateCnt++] = 74;
                  break;
=============

It's kind of complicated, so before I attempt to delve into I thought I
should ask if I am looking at the right place.

Thanks again!
Tareque



>
> 20 dec 2007 kl. 18.43 skrev tareque@controldocs.com:
>
>> I am using StandardAnalyzer for my indexes. Now I don't want to be
>> able to
>> be search whole email addresses, and want to consider '@' as a
>> punctuation
>> too. Because my users would rather be able to search for user id and/
>> or
>> the host name to return all the email addresses than searching by the
>> whole address. And, that way, then can create a query that will return
>> email addresses anyway.
>>
>> How do I let StandardAnalyzer consider '@' as a punctuation?
>
> A quick and dirty solution is to introduce a TokenFilter that splits
> any token at @ and add it to the end of the chain of streams in
> StandardAnalyzer#tokenStream.
>
> It would probably be much more efficient if you modified the lexer
> grammar StandardTokenzier is generated from.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.

20 dec 2007 kl. 18.43 skrev tareque@controldocs.com:

> I am using StandardAnalyzer for my indexes. Now I don't want to be  
> able to
> be search whole email addresses, and want to consider '@' as a  
> punctuation
> too. Because my users would rather be able to search for user id and/ 
> or
> the host name to return all the email addresses than searching by the
> whole address. And, that way, then can create a query that will return
> email addresses anyway.
>
> How do I let StandardAnalyzer consider '@' as a punctuation?

A quick and dirty solution is to introduce a TokenFilter that splits  
any token at @ and add it to the end of the chain of streams in  
StandardAnalyzer#tokenStream.

It would probably be much more efficient if you modified the lexer  
grammar StandardTokenzier is generated from.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org