You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Byrne <jo...@propylon.com> on 2007/10/01 15:33:31 UTC

Indexing puncuation and symbols

Hi,

Has anyone written an analyzer that preserves puncuation and synmbols 
("£", "$", "%" etc.) as tokens?

That way we could distinguish between searching for "100" and "100%" or 
"$100".

Does anyone know of a reason why that wouldn't work? I notice that even 
Google doesn't support that. But I can't think why.

Regards,
John B.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncuation and symbols

Posted by Erick Erickson <er...@gmail.com>.
You might be able to create an analyzer that breaks your
stream up (from the example) into tokens
"foo" and "," and then (using the same analyzer)
search on phrases with a slop of 0. That seems like
it'd do what you want.....

Best
Erick

On 10/1/07, Patrick Turcotte <pa...@gmail.com> wrote:
>
> Of course, it depends on the kind of query you are doing, but (I did
> find the query parser in the mean time)
>
> MultiFieldQueryParser mfqp = new MultiFieldQueryParser(useFields,
> analyzer, boosts);
> where analyzer can be a PerFieldAnalyzer
> followed by
> Query query = mfqp.parse(queryString);
> would do the trick quite simply.
>
> Patrick
>
> On 10/1/07, John Byrne <jo...@propylon.com> wrote:
> > Well, the size wouldn't be a problem, we could afford the extra field.
> > But it would seem to complicate the search quite a lot. I'd have to run
> > the search terms through both analyzers. It would be much simpler if the
> > characters were indexed as separate tokens.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Indexing puncuation and symbols

Posted by Patrick Turcotte <pa...@gmail.com>.
Of course, it depends on the kind of query you are doing, but (I did
find the query parser in the mean time)

MultiFieldQueryParser mfqp = new MultiFieldQueryParser(useFields,
analyzer, boosts);
where analyzer can be a PerFieldAnalyzer
followed by
Query query = mfqp.parse(queryString);
would do the trick quite simply.

Patrick

On 10/1/07, John Byrne <jo...@propylon.com> wrote:
> Well, the size wouldn't be a problem, we could afford the extra field.
> But it would seem to complicate the search quite a lot. I'd have to run
> the search terms through both analyzers. It would be much simpler if the
> characters were indexed as separate tokens.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.
I actually hadn't implemented the TokenFilter solution before deciding not
to go with that solution, so didn't have any benchmark.

But eventually I have taken care of this problem with a different
variation of your quick and dirty solution. I have captured the character
'@' in FastCharStream.java, and replaced it with a blank space. That took
care of it.

Thanks for your help!
Tareque

> 20 dec 2007 kl. 22.32 skrev tareque@controldocs.com:
>
>> In fact I had previously located the grammar in StandardTokenizer.jj
>> (just wasn't sure if that was the one u were talking about) and had
>> commented out EMAIL entries from all the following files:
>>
>> StandardTokenizer.java
>> StandardTokenizer.jj
>> StandardTokenizerConstants.java
>>
>> Now what is puzzling to me is that though I don't see the '@'
>
> I think you'll find the JavaCC-list a much better forum for this
> question. You do however seem a bit confused about the fact that
> StandardTokenizer and StandardTokenierConstants are the generated
> artifacts via Ant build, based on StandardTokenizer.jj.
>
> Why was the TokenFilter solution not good enough? What was the results
> from your benchmarks?
>
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.
20 dec 2007 kl. 22.32 skrev tareque@controldocs.com:

> In fact I had previously located the grammar in StandardTokenizer.jj  
> (just wasn't sure if that was the one u were talking about) and had  
> commented out EMAIL entries from all the following files:
>
> StandardTokenizer.java
> StandardTokenizer.jj
> StandardTokenizerConstants.java
>
> Now what is puzzling to me is that though I don't see the '@'

I think you'll find the JavaCC-list a much better forum for this  
question. You do however seem a bit confused about the fact that  
StandardTokenizer and StandardTokenierConstants are the generated  
artifacts via Ant build, based on StandardTokenizer.jj.

Why was the TokenFilter solution not good enough? What was the results  
from your benchmarks?


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.
Karl,

I should have mentioned before, I have Lucene 1.9.1.

In fact I had previously located the grammar in StandardTokenizer.jj (just
wasn't sure if that was the one u were talking about) and had commented
out EMAIL entries from all the following files:

StandardTokenizer.java
StandardTokenizer.jj
StandardTokenizerConstants.java

But evidently the tokenizer was expecting the email addresses to be one of
the other TOKEN types. But since they were matching with none of them it
was throwing a ParseException.

Now what is puzzling to me is that though I don't see the '@' (unicode
value 0040) sign to be included in "LETTER" or any other definition, why
is it not  splitting the words? It certainly isn't, which is why Tokenizer
is expecting the email address to be defined as a TYPE. My understanding,
looking at the code, is that whichever characters were not defined in the
grammar, would be acting as splitter, since they are not contributing to
any TOKEN definition.

Please let me know what I am missing.

Thanks
Tareque

>
> 20 dec 2007 kl. 20.21 skrev tareque@controldocs.com:
>
>> I would rather like to modify the lexer grammar. But exactly where
>> it is
>> defined. After having a quick look, seems like
>> StandardTokenizerTokenManager.java may be where it is being done.
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> It can be generated with the Ant build.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.
20 dec 2007 kl. 20.21 skrev tareque@controldocs.com:

> I would rather like to modify the lexer grammar. But exactly where  
> it is
> defined. After having a quick look, seems like
> StandardTokenizerTokenManager.java may be where it is being done.

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

It can be generated with the Ant build.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.
Thanks Karl,

I would rather like to modify the lexer grammar. But exactly where it is
defined. After having a quick look, seems like
StandardTokenizerTokenManager.java may be where it is being done.
Ampersand having a decimal value of '38', I was assuming that the
following step is taken when faced with ampersand:

=============
              case 73:
                  if (curChar == 38)
                     jjstateSet[jjnewStateCnt++] = 74;
                  break;
=============

It's kind of complicated, so before I attempt to delve into I thought I
should ask if I am looking at the right place.

Thanks again!
Tareque



>
> 20 dec 2007 kl. 18.43 skrev tareque@controldocs.com:
>
>> I am using StandardAnalyzer for my indexes. Now I don't want to be
>> able to
>> be search whole email addresses, and want to consider '@' as a
>> punctuation
>> too. Because my users would rather be able to search for user id and/
>> or
>> the host name to return all the email addresses than searching by the
>> whole address. And, that way, then can create a query that will return
>> email addresses anyway.
>>
>> How do I let StandardAnalyzer consider '@' as a punctuation?
>
> A quick and dirty solution is to introduce a TokenFilter that splits
> any token at @ and add it to the end of the chain of streams in
> StandardAnalyzer#tokenStream.
>
> It would probably be much more efficient if you modified the lexer
> grammar StandardTokenzier is generated from.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Changing the Punctuation definition for StandardAnalyzer

Posted by Karl Wettin <ka...@gmail.com>.
20 dec 2007 kl. 18.43 skrev tareque@controldocs.com:

> I am using StandardAnalyzer for my indexes. Now I don't want to be  
> able to
> be search whole email addresses, and want to consider '@' as a  
> punctuation
> too. Because my users would rather be able to search for user id and/ 
> or
> the host name to return all the email addresses than searching by the
> whole address. And, that way, then can create a query that will return
> email addresses anyway.
>
> How do I let StandardAnalyzer consider '@' as a punctuation?

A quick and dirty solution is to introduce a TokenFilter that splits  
any token at @ and add it to the end of the chain of streams in  
StandardAnalyzer#tokenStream.

It would probably be much more efficient if you modified the lexer  
grammar StandardTokenzier is generated from.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Changing the Punctuation definition for StandardAnalyzer

Posted by ta...@controldocs.com.
I am using StandardAnalyzer for my indexes. Now I don't want to be able to
be search whole email addresses, and want to consider '@' as a punctuation
too. Because my users would rather be able to search for user id and/or
the host name to return all the email addresses than searching by the
whole address. And, that way, then can create a query that will return
email addresses anyway.

How do I let StandardAnalyzer consider '@' as a punctuation?

Thanks
Tareque

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncuation and symbols

Posted by John Byrne <jo...@propylon.com>.
Well, the size wouldn't be a problem, we could afford the extra field. 
But it would seem to complicate the search quite a lot. I'd have to run 
the search terms through both analyzers. It would be much simpler if the 
characters were indexed as separate tokens.

Patrick Turcotte wrote:
> Hi,
>
> Don't know the size of your dataset. But, couldn't you index in 2
> fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
> and WhiteSpace for the other.
>
> Then use multiple field query (there is a query parser for that, just
> don't remember the name right now).
>
> Patrick
>
> On 10/1/07, John Byrne <jo...@propylon.com> wrote:
>   
>> Whitespace analyzer does preserve those symbols, but not as tokens. It
>> simply leaves them attached to the original term.
>>
>> As an example of what I'm talking about, consider a document that
>> contains (without the quotes) "foo, ".
>>
>> Now, using WhitespaceAnalyzer, I could only get that document by
>> searching for "foo,". Using StandardAnalyzer or any analyzer that
>> removes punctuation, I could only find it by searching for "foo".
>>
>> I want an analyzer that will allow me to find it if I build a phrase
>> query with the term "foo" followed immediately by ",". After all, the
>> comma may be relevant to the search, but is definitely not part of the
>> word.
>>
>> Extending StandardAnalyer is what I had in mind, but I don't know where
>> to start. I also wonder why no-one seems to have done it before- it
>> makes me suspect that there's some reason I haven't seen yet that makes
>> it impossible ot impractical.
>>
>>
>>
>> Karl Wettin wrote:
>>     
>>> 1 okt 2007 kl. 15.33 skrev John Byrne:
>>>
>>>       
>>>> Has anyone written an analyzer that preserves puncuation and
>>>> synmbols ("£", "$", "%" etc.) as tokens?
>>>>         
>>> WhitespaceAnalyzer?
>>>
>>> You could also extend the lexical rules of StandardAnalyzer.
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncuation and symbols

Posted by Patrick Turcotte <pa...@gmail.com>.
Hi,

Don't know the size of your dataset. But, couldn't you index in 2
fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
and WhiteSpace for the other.

Then use multiple field query (there is a query parser for that, just
don't remember the name right now).

Patrick

On 10/1/07, John Byrne <jo...@propylon.com> wrote:
> Whitespace analyzer does preserve those symbols, but not as tokens. It
> simply leaves them attached to the original term.
>
> As an example of what I'm talking about, consider a document that
> contains (without the quotes) "foo, ".
>
> Now, using WhitespaceAnalyzer, I could only get that document by
> searching for "foo,". Using StandardAnalyzer or any analyzer that
> removes punctuation, I could only find it by searching for "foo".
>
> I want an analyzer that will allow me to find it if I build a phrase
> query with the term "foo" followed immediately by ",". After all, the
> comma may be relevant to the search, but is definitely not part of the
> word.
>
> Extending StandardAnalyer is what I had in mind, but I don't know where
> to start. I also wonder why no-one seems to have done it before- it
> makes me suspect that there's some reason I haven't seen yet that makes
> it impossible ot impractical.
>
>
>
> Karl Wettin wrote:
> >
> > 1 okt 2007 kl. 15.33 skrev John Byrne:
> >
> >> Has anyone written an analyzer that preserves puncuation and
> >> synmbols ("£", "$", "%" etc.) as tokens?
> >
> > WhitespaceAnalyzer?
> >
> > You could also extend the lexical rules of StandardAnalyzer.
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncuation and symbols

Posted by John Byrne <jo...@propylon.com>.
Whitespace analyzer does preserve those symbols, but not as tokens. It 
simply leaves them attached to the original term.

As an example of what I'm talking about, consider a document that 
contains (without the quotes) "foo, ".

Now, using WhitespaceAnalyzer, I could only get that document by 
searching for "foo,". Using StandardAnalyzer or any analyzer that 
removes punctuation, I could only find it by searching for "foo".

I want an analyzer that will allow me to find it if I build a phrase 
query with the term "foo" followed immediately by ",". After all, the 
comma may be relevant to the search, but is definitely not part of the 
word.

Extending StandardAnalyer is what I had in mind, but I don't know where 
to start. I also wonder why no-one seems to have done it before- it 
makes me suspect that there's some reason I haven't seen yet that makes 
it impossible ot impractical.



Karl Wettin wrote:
>
> 1 okt 2007 kl. 15.33 skrev John Byrne:
>
>> Has anyone written an analyzer that preserves puncuation and
>> synmbols ("£", "$", "%" etc.) as tokens?
>
> WhitespaceAnalyzer?
>
> You could also extend the lexical rules of StandardAnalyzer.
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncuation and symbols

Posted by Karl Wettin <ka...@gmail.com>.
1 okt 2007 kl. 15.33 skrev John Byrne:

> Has anyone written an analyzer that preserves puncuation and
> synmbols ("£", "$", "%" etc.) as tokens?

WhitespaceAnalyzer?

You could also extend the lexical rules of StandardAnalyzer.


-- 
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org