You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by manjula wijewickrema <ma...@gmail.com> on 2010/12/02 09:01:24 UTC
Re: Analyzer

Dear Erick,

Thanx for your information.

Manjula.

On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson <er...@gmail.com>wrote:

> WhitespaceAnalyzer does just that, splits the incoming stream on
> white space.
>
> From the javadocs for StandardAnalyzer:
>
> A grammar-based tokenizer constructed with JFlex
>
> This should be a good tokenizer for most European-language documents:
>
>   - Splits words at punctuation characters, removing punctuation. However,
>   a dot that's not followed by whitespace is considered part of a token.
>   - Splits words at hyphens, unless there's a number in the token, in which
>   case the whole token is interpreted as a product number and is not split.
>   - Recognizes email addresses and internet hostnames as one token.
>
> Many applications have specific tokenizer needs. If this tokenizer does not
> suit your application, please consider copying this source code directory
> to
> your project and maintaining your own grammar-based tokenizer.
>
>
> Best
>
> Erick
>
> On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
> <ma...@gmail.com>wrote:
>
> > Hi Steve,
> >
> > Thanx a lot for your reply. Yes there are only two classes and it's
> corrcet
> > that the way you have realized the problem. As you have instructed, I
> > checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
> > it
> > seems to me that it gives better results rather than StandardAnalyzer. So
> > could you please let me know what are the differences between
> > StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your
> response.
> > Thanx.
> >
> > Manjula.
> >
> >
> > On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> > > Hi Manjula,
> > >
> > > It's not terribly clear what you're doing here - I got lost in your
> > > description of your (two? or maybe four?) classes.  Sometimes things
> are
> > > easier to understand if you provide more concrete detail.
> > >
> > > I suspect that you could benefit from reading the book Lucene in
> Action,
> > > 2nd edition:
> > >
> > >   http://www.manning.com/hatcher3/
> > >
> > > You would also likely benefit from using Luke, the Lucene index
> browser,
> > to
> > > better understand your indexes' contents and debug how queries match
> > > documents:
> > >
> > >   http://code.google.com/p/luke/
> > >
> > > I think your question is whether you're using Analyzers correctly.  It
> > > sounds like you are creating two separate indexes (one for each of your
> > > classes), and you're using SnowballAnalyzer on the indexing side for
> both
> > > indexes, and StandardAnalyzer on the query side.
> > >
> > > The usual advice is to use the same Analyzer on both the query and the
> > > index side.  But it appears to be the case that you are taking stemmed
> > index
> > > terms from your index #1 and then querying index #2 using these stemmed
> > > terms.  If this is true, then you want the query-time analyzer in your
> > > second index not to change the query terms.  You'll likely get better
> > > results using WhitespaceAnalyzer, which tokenizes on whitespace and
> does
> > no
> > > further analysis, rather than StandardAnalyzer.
> > >
> > > Steve
> > >
> > > > -----Original Message-----
> > > > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > > > Sent: Monday, November 29, 2010 4:32 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Analyzer
> > > >
> > > > Hi,
> > > >
> > > > In my work, I am using Lucene and two java classes. In the first one,
> I
> > > > index a document and in the second one, I try to search the most
> > relevant
> > > > document for the indexed document in the first one. In the first java
> > > > class,
> > > > I use the SnowballAnalyzer in the createIndex method and
> > StandardAnalyzer
> > > > in
> > > > the searchIndex method and pass the highest frequency terms into the
> > > > second
> > > > Java class. In the second class, I use SnowballAnalyzer in the
> > > createIndex
> > > > method (this index is for the collection of documents to be searched,
> > or
> > > > it
> > > > is my database) and StandardAnalyser in the searchIndex method (I
> pass
> > > the
> > > > highest frequently occuring term of the first class as the search
> term
> > > > parameter to the searchIndex method of the second class). Using
> > Analyzers
> > > > in
> > > > this manner, what I am willing is to do the stemming, stop-words in
> > both
> > > > indexes (in both classes) and to search those a few high frequency
> > words
> > > > (of
> > > > the first index) in the second index. So, if my intention is clear to
> > > you,
> > > > could you please let me know whether it is correct or not the way I
> > have
> > > > used Analyzers? I highly appreciate any comment.
> > > >
> > > > Thanx.
> > > > Manjula.
> > >
> >
>