You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by manjula wijewickrema <ma...@gmail.com> on 2010/11/29 10:31:49 UTC

Analyzer

Hi,

In my work, I am using Lucene and two java classes. In the first one, I
index a document and in the second one, I try to search the most relevant
document for the indexed document in the first one. In the first java class,
I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer in
the searchIndex method and pass the highest frequency terms into the second
Java class. In the second class, I use SnowballAnalyzer in the createIndex
method (this index is for the collection of documents to be searched, or it
is my database) and StandardAnalyser in the searchIndex method (I pass the
highest frequently occuring term of the first class as the search term
parameter to the searchIndex method of the second class). Using Analyzers in
this manner, what I am willing is to do the stemming, stop-words in both
indexes (in both classes) and to search those a few high frequency words (of
the first index) in the second index. So, if my intention is clear to you,
could you please let me know whether it is correct or not the way I have
used Analyzers? I highly appreciate any comment.

Thanx.
Manjula.

Re: Analyzer

Posted by manjula wijewickrema <ma...@gmail.com>.
Dear Erick,

Thanx for your information.

Manjula.

On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson <er...@gmail.com>wrote:

> WhitespaceAnalyzer does just that, splits the incoming stream on
> white space.
>
> From the javadocs for StandardAnalyzer:
>
> A grammar-based tokenizer constructed with JFlex
>
> This should be a good tokenizer for most European-language documents:
>
>   - Splits words at punctuation characters, removing punctuation. However,
>   a dot that's not followed by whitespace is considered part of a token.
>   - Splits words at hyphens, unless there's a number in the token, in which
>   case the whole token is interpreted as a product number and is not split.
>   - Recognizes email addresses and internet hostnames as one token.
>
> Many applications have specific tokenizer needs. If this tokenizer does not
> suit your application, please consider copying this source code directory
> to
> your project and maintaining your own grammar-based tokenizer.
>
>
> Best
>
> Erick
>
> On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
> <ma...@gmail.com>wrote:
>
> > Hi Steve,
> >
> > Thanx a lot for your reply. Yes there are only two classes and it's
> corrcet
> > that the way you have realized the problem. As you have instructed, I
> > checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
> > it
> > seems to me that it gives better results rather than StandardAnalyzer. So
> > could you please let me know what are the differences between
> > StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your
> response.
> > Thanx.
> >
> > Manjula.
> >
> >
> > On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> > > Hi Manjula,
> > >
> > > It's not terribly clear what you're doing here - I got lost in your
> > > description of your (two? or maybe four?) classes.  Sometimes things
> are
> > > easier to understand if you provide more concrete detail.
> > >
> > > I suspect that you could benefit from reading the book Lucene in
> Action,
> > > 2nd edition:
> > >
> > >   http://www.manning.com/hatcher3/
> > >
> > > You would also likely benefit from using Luke, the Lucene index
> browser,
> > to
> > > better understand your indexes' contents and debug how queries match
> > > documents:
> > >
> > >   http://code.google.com/p/luke/
> > >
> > > I think your question is whether you're using Analyzers correctly.  It
> > > sounds like you are creating two separate indexes (one for each of your
> > > classes), and you're using SnowballAnalyzer on the indexing side for
> both
> > > indexes, and StandardAnalyzer on the query side.
> > >
> > > The usual advice is to use the same Analyzer on both the query and the
> > > index side.  But it appears to be the case that you are taking stemmed
> > index
> > > terms from your index #1 and then querying index #2 using these stemmed
> > > terms.  If this is true, then you want the query-time analyzer in your
> > > second index not to change the query terms.  You'll likely get better
> > > results using WhitespaceAnalyzer, which tokenizes on whitespace and
> does
> > no
> > > further analysis, rather than StandardAnalyzer.
> > >
> > > Steve
> > >
> > > > -----Original Message-----
> > > > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > > > Sent: Monday, November 29, 2010 4:32 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Analyzer
> > > >
> > > > Hi,
> > > >
> > > > In my work, I am using Lucene and two java classes. In the first one,
> I
> > > > index a document and in the second one, I try to search the most
> > relevant
> > > > document for the indexed document in the first one. In the first java
> > > > class,
> > > > I use the SnowballAnalyzer in the createIndex method and
> > StandardAnalyzer
> > > > in
> > > > the searchIndex method and pass the highest frequency terms into the
> > > > second
> > > > Java class. In the second class, I use SnowballAnalyzer in the
> > > createIndex
> > > > method (this index is for the collection of documents to be searched,
> > or
> > > > it
> > > > is my database) and StandardAnalyser in the searchIndex method (I
> pass
> > > the
> > > > highest frequently occuring term of the first class as the search
> term
> > > > parameter to the searchIndex method of the second class). Using
> > Analyzers
> > > > in
> > > > this manner, what I am willing is to do the stemming, stop-words in
> > both
> > > > indexes (in both classes) and to search those a few high frequency
> > words
> > > > (of
> > > > the first index) in the second index. So, if my intention is clear to
> > > you,
> > > > could you please let me know whether it is correct or not the way I
> > have
> > > > used Analyzers? I highly appreciate any comment.
> > > >
> > > > Thanx.
> > > > Manjula.
> > >
> >
>

Re: Analyzer

Posted by Erick Erickson <er...@gmail.com>.
WhitespaceAnalyzer does just that, splits the incoming stream on
white space.

>From the javadocs for StandardAnalyzer:

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

   - Splits words at punctuation characters, removing punctuation. However,
   a dot that's not followed by whitespace is considered part of a token.
   - Splits words at hyphens, unless there's a number in the token, in which
   case the whole token is interpreted as a product number and is not split.
   - Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not
suit your application, please consider copying this source code directory to
your project and maintaining your own grammar-based tokenizer.


Best

Erick

On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
<ma...@gmail.com>wrote:

> Hi Steve,
>
> Thanx a lot for your reply. Yes there are only two classes and it's corrcet
> that the way you have realized the problem. As you have instructed, I
> checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
> it
> seems to me that it gives better results rather than StandardAnalyzer. So
> could you please let me know what are the differences between
> StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response.
> Thanx.
>
> Manjula.
>
>
> On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe <sa...@syr.edu> wrote:
>
> > Hi Manjula,
> >
> > It's not terribly clear what you're doing here - I got lost in your
> > description of your (two? or maybe four?) classes.  Sometimes things are
> > easier to understand if you provide more concrete detail.
> >
> > I suspect that you could benefit from reading the book Lucene in Action,
> > 2nd edition:
> >
> >   http://www.manning.com/hatcher3/
> >
> > You would also likely benefit from using Luke, the Lucene index browser,
> to
> > better understand your indexes' contents and debug how queries match
> > documents:
> >
> >   http://code.google.com/p/luke/
> >
> > I think your question is whether you're using Analyzers correctly.  It
> > sounds like you are creating two separate indexes (one for each of your
> > classes), and you're using SnowballAnalyzer on the indexing side for both
> > indexes, and StandardAnalyzer on the query side.
> >
> > The usual advice is to use the same Analyzer on both the query and the
> > index side.  But it appears to be the case that you are taking stemmed
> index
> > terms from your index #1 and then querying index #2 using these stemmed
> > terms.  If this is true, then you want the query-time analyzer in your
> > second index not to change the query terms.  You'll likely get better
> > results using WhitespaceAnalyzer, which tokenizes on whitespace and does
> no
> > further analysis, rather than StandardAnalyzer.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > > Sent: Monday, November 29, 2010 4:32 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Analyzer
> > >
> > > Hi,
> > >
> > > In my work, I am using Lucene and two java classes. In the first one, I
> > > index a document and in the second one, I try to search the most
> relevant
> > > document for the indexed document in the first one. In the first java
> > > class,
> > > I use the SnowballAnalyzer in the createIndex method and
> StandardAnalyzer
> > > in
> > > the searchIndex method and pass the highest frequency terms into the
> > > second
> > > Java class. In the second class, I use SnowballAnalyzer in the
> > createIndex
> > > method (this index is for the collection of documents to be searched,
> or
> > > it
> > > is my database) and StandardAnalyser in the searchIndex method (I pass
> > the
> > > highest frequently occuring term of the first class as the search term
> > > parameter to the searchIndex method of the second class). Using
> Analyzers
> > > in
> > > this manner, what I am willing is to do the stemming, stop-words in
> both
> > > indexes (in both classes) and to search those a few high frequency
> words
> > > (of
> > > the first index) in the second index. So, if my intention is clear to
> > you,
> > > could you please let me know whether it is correct or not the way I
> have
> > > used Analyzers? I highly appreciate any comment.
> > >
> > > Thanx.
> > > Manjula.
> >
>

Re: Analyzer

Posted by manjula wijewickrema <ma...@gmail.com>.
Hi Steve,

Thanx a lot for your reply. Yes there are only two classes and it's corrcet
that the way you have realized the problem. As you have instructed, I
checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and it
seems to me that it gives better results rather than StandardAnalyzer. So
could you please let me know what are the differences between
StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response.
Thanx.

Manjula.


On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Manjula,
>
> It's not terribly clear what you're doing here - I got lost in your
> description of your (two? or maybe four?) classes.  Sometimes things are
> easier to understand if you provide more concrete detail.
>
> I suspect that you could benefit from reading the book Lucene in Action,
> 2nd edition:
>
>   http://www.manning.com/hatcher3/
>
> You would also likely benefit from using Luke, the Lucene index browser, to
> better understand your indexes' contents and debug how queries match
> documents:
>
>   http://code.google.com/p/luke/
>
> I think your question is whether you're using Analyzers correctly.  It
> sounds like you are creating two separate indexes (one for each of your
> classes), and you're using SnowballAnalyzer on the indexing side for both
> indexes, and StandardAnalyzer on the query side.
>
> The usual advice is to use the same Analyzer on both the query and the
> index side.  But it appears to be the case that you are taking stemmed index
> terms from your index #1 and then querying index #2 using these stemmed
> terms.  If this is true, then you want the query-time analyzer in your
> second index not to change the query terms.  You'll likely get better
> results using WhitespaceAnalyzer, which tokenizes on whitespace and does no
> further analysis, rather than StandardAnalyzer.
>
> Steve
>
> > -----Original Message-----
> > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > Sent: Monday, November 29, 2010 4:32 AM
> > To: java-user@lucene.apache.org
> > Subject: Analyzer
> >
> > Hi,
> >
> > In my work, I am using Lucene and two java classes. In the first one, I
> > index a document and in the second one, I try to search the most relevant
> > document for the indexed document in the first one. In the first java
> > class,
> > I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer
> > in
> > the searchIndex method and pass the highest frequency terms into the
> > second
> > Java class. In the second class, I use SnowballAnalyzer in the
> createIndex
> > method (this index is for the collection of documents to be searched, or
> > it
> > is my database) and StandardAnalyser in the searchIndex method (I pass
> the
> > highest frequently occuring term of the first class as the search term
> > parameter to the searchIndex method of the second class). Using Analyzers
> > in
> > this manner, what I am willing is to do the stemming, stop-words in both
> > indexes (in both classes) and to search those a few high frequency words
> > (of
> > the first index) in the second index. So, if my intention is clear to
> you,
> > could you please let me know whether it is correct or not the way I have
> > used Analyzers? I highly appreciate any comment.
> >
> > Thanx.
> > Manjula.
>

RE: Analyzer

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Manjula,

It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes.  Sometimes things are easier to understand if you provide more concrete detail.

I suspect that you could benefit from reading the book Lucene in Action, 2nd edition: 

   http://www.manning.com/hatcher3/

You would also likely benefit from using Luke, the Lucene index browser, to better understand your indexes' contents and debug how queries match documents: 

   http://code.google.com/p/luke/

I think your question is whether you're using Analyzers correctly.  It sounds like you are creating two separate indexes (one for each of your classes), and you're using SnowballAnalyzer on the indexing side for both indexes, and StandardAnalyzer on the query side.  

The usual advice is to use the same Analyzer on both the query and the index side.  But it appears to be the case that you are taking stemmed index terms from your index #1 and then querying index #2 using these stemmed terms.  If this is true, then you want the query-time analyzer in your second index not to change the query terms.  You'll likely get better results using WhitespaceAnalyzer, which tokenizes on whitespace and does no further analysis, rather than StandardAnalyzer.

Steve

> -----Original Message-----
> From: manjula wijewickrema [mailto:manjula53@gmail.com]
> Sent: Monday, November 29, 2010 4:32 AM
> To: java-user@lucene.apache.org
> Subject: Analyzer
> 
> Hi,
> 
> In my work, I am using Lucene and two java classes. In the first one, I
> index a document and in the second one, I try to search the most relevant
> document for the indexed document in the first one. In the first java
> class,
> I use the SnowballAnalyzer in the createIndex method and StandardAnalyzer
> in
> the searchIndex method and pass the highest frequency terms into the
> second
> Java class. In the second class, I use SnowballAnalyzer in the createIndex
> method (this index is for the collection of documents to be searched, or
> it
> is my database) and StandardAnalyser in the searchIndex method (I pass the
> highest frequently occuring term of the first class as the search term
> parameter to the searchIndex method of the second class). Using Analyzers
> in
> this manner, what I am willing is to do the stemming, stop-words in both
> indexes (in both classes) and to search those a few high frequency words
> (of
> the first index) in the second index. So, if my intention is clear to you,
> could you please let me know whether it is correct or not the way I have
> used Analyzers? I highly appreciate any comment.
> 
> Thanx.
> Manjula.