You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Xavier To <to...@courrier.uqam.ca> on 2007/02/09 14:29:26 UTC

Re : Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers

Hey, thanks a lot for taking so much time here...

I did check the and they appear to be the same...at least they are same class and same package. I just noticed something : they are using LowerCaseFilter.... I was going to say "could it be the source of the numbers being ignored ?" but it shouldn't since they are indexed (the modification of using WhitespaceAnalyzer during the search did return the exact number of results for "2002" which is 5. 

As for the tokenizing, shouldn't a query be tokenized ? It was already like that, and all I did was modify it so it would use Lucene's tokenizing methods... If a query shouldn't be tokenized, maybe tokenizing it is the problem. If it should be tokenized,  what am I doing wrong that forces me to add a single blank after each token ? I mean, I don't understand what the analyzer has to do with the tokenizing process... The reason why I add a blank is because the tokens are getting appended into a string, and then the string is sent through QueryParser. 

As I said, I don't really understand why the guy who made this search engine didn't just sent the query as a long string instead of tokenizing it, but since it was working fine with alphabetical searches, I said to myself "it must be the way to do it".

Xavier Tô
Bacc. en Informatique et Génie Logiciel
to.xavier@courrier.uqam.ca
(450)434-8905

----- Message d'origine -----
De: Erick Erickson <er...@gmail.com>
Date: Jeudi, Février 8, 2007 5:13 pm
Objet: Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers

> See below....
> 
> On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> >
> > Thanks for helping me.
> >
> > I don't really understand what you mean by my Tokenizer 
> "corrects" what
> > the indexing analyzer did.
> 
> 
> You shouldn't have to do change the tokens in the usual case to get 
> thesearch to work right. You mentioned tokenizing the search 
> string, but then
> having to add whitespaces back in. That step is the step that 
> "corrects"what the analyzer did. I put "corrects" in quotes because 
> it isn't really
> correcting anything, the analyzers are doing what they should. But 
> if you
> have to make this manual change, you're trying to fix up the query 
> string to
> match what the analyzer did at index time. Which will leave you 
> correctingthis, then that, then the other thing when it would be 
> much better just to
> use the same analyzer if possible. I've just seen too many "oh, 
> there's one
> more thing" statements in this situation.
> 
> 
> By the way, the tokenizer we use is one provided in Lucene. My 
> guess is that
> > the problem was that the analyzer was thought to be the same by 
> the guy who
> > made the search engine, but the querying analyzer is fetched 
> inside a JAR by
> > a bean. Could it be that this is the problem ?
> 
> 
> It shouldn't be if the same analyzer is fetched inside the bean. 
> Can't you
> check what analyzer is used in both cases?
> 
> Erick
> 
> 
> Xavier Tô
> > Bacc. en Informatique et Génie Logiciel
> > to.xavier@courrier.uqam.ca
> > (450)434-8905
> >
> > ----- Message d'origine -----
> > De: Erick Erickson <er...@gmail.com>
> > Date: Jeudi, Février 8, 2007 12:51 pm
> > Objet: Re: Re : Re: Re : Re: Question concerning Analyzers
> >
> > > Well, you've proved that your problem is that the analyzer you're
> > > using when
> > > querying isn't matching what you use during indexing. I think that
> > > whatyou've done will lead you into significant problems down the
> > > road as your
> > > tokenizer then has to "correct" for what the index analyzer did
> > > though.
> > > What would probably be MUCH less work in the long run is to 
> align the
> > > analyzer you use at query time with the analyzer you use at index
> > > time. You
> > > can use a PerFieldAnalyzerWrapper to handle different fields in
> > > differentways. Forget your custom tokenizer for the time being,
> > > just try using the
> > > same analyzer during searching that you used during indexing. You
> > > can use
> > > the
> > > *QueryParser<file:///C:/lucene-
> > > 
> 2.0.0/docs/api/org/apache/lucene/queryParser/QueryParser.html#QueryParser%28java.lang.String,%20org.apache.lucene.analysis.Analyzer%29>*(String> <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
> > > Analyzer<file:///C:/lucene-
> > > 2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html> a)
> > >
> > > form of the QueryParser, where the Analyzer is the same one you
> > > used when
> > > indexing. There are some circumstances where you want to use 
> different> > analyzers when querying and when indexing, but don't 
> go there
> > > unless you
> > > need to <G>....
> > >
> > > If that doesn't do what you want, I'd really recommend is that you
> > > make your
> > > own custom Analyzer, built on, say, WhitespaceTokenizer,
> > > LowerCaseFilter.This is usually the way I've approached this kind
> > > of problem. And use *that*
> > > one at index and query time.
> > >
> > > There's an example in Lucene In Action, see the SynonymAnalyzer
> > > example.That example is MUCH more complex than you'll need <G>...
> > >
> > > Best
> > > Erick
> > >
> > > On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > >
> > > > Hey !
> > > >
> > > > I tried using WhitespaceAnalyzer during the search and it 
> works. I
> > > > refactored the tokenizing process so it uses TokenStream 
> instead of
> > > > StringTokenizer and it works fine for one thing : the query 
> "this> > is a test"
> > > > becomes "thisisatest". I fixed it by adding a space after each
> > > token except
> > > > for the last one, but is there a clean way to do it ? I'm using
> > > > WhitespaceTokenizer.
> > > >
> > > > Thanks a bunch !
> > > >
> > > > Xavier Tô
> > > > Bacc. en Informatique et Génie Logiciel
> > > > to.xavier@courrier.uqam.ca
> > > > (450)434-8905
> > > >
> > > > ----- Message d'origine -----
> > > > De: Erick Erickson <er...@gmail.com>
> > > > Date: Mercredi, Février 7, 2007 4:28 pm
> > > > Objet: Re: Re : Re: Question concerning Analyzers
> > > >
> > > > > Then the analyzer you're using when parsing the query is 
> stripping> > > > them. It
> > > > > must be different than the one you use when indexing somehow.
> > > At least
> > > > > that's the only explanation I can imagine....
> > > > >
> > > > > Perhaps, somehow, you are using a default analyzer when you
> > > parse a
> > > > > query?Or you aren't specifying the field when you query and
> > > thus a
> > > > > default is
> > > > > used? Or you are using a PerFieldAnalyzerWrapper and dropping
> > > > > through to the
> > > > > default? or ????
> > > > >
> > > > > Just for yucks, I'd try using WhitespaceAnalyzer on a query 
> with> > > > somethingyou *know* exists in the index for a 
> particular field and
> > > > > work my way up to
> > > > > whatever your real problem is in small steps (since you 
> can't post
> > > > > code<G>)......
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On 2/7/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > > > >
> > > > > > Thanks Erik and Erick,
> > > > > >
> > > > > > I guess my question was rather unclear, but you guys 
> answered it
> > > > > all the
> > > > > > same : it is impossible for an analyzer to index 
> something and
> > > > > having the
> > > > > > same analyzer ignore the thing indexed during a search.
> > > > > >
> > > > > > If it makes everything clearer, during indexation, 
> numbers  are
> > > > > indexed,> whether or not they are accompanied by letters ( 
> 2003> > and> > 4wd are both
> > > > > > indexed). That's fine, since we want this.  The problem 
> occurs> > > > when I try to
> > > > > > search for them : They are ignored. I know they are indexed
> > > > > because I ran
> > > > > > through the index using Luke.
> > > > > >
> > > > > > Any thoughts regarding this problem ?
> > > > > >
> > > > > > Xavier Tô
> > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > to.xavier@courrier.uqam.ca
> > > > > > (450)434-8905
> > > > > >
> > > > > > ----- Message d'origine -----
> > > > > > De: Erik Hatcher <er...@ehatchersolutions.com>
> > > > > > Date: Mercredi, Février 7, 2007 3:15 pm
> > > > > > Objet: Re: Question concerning Analyzers
> > > > > >
> > > > > > > There is no requirement that you use the same analyzer to
> > > > > search as
> > > > > > >
> > > > > > > you used to index.  So, yes, you could certainly index
> > > things and
> > > > > > > ignore them during a search.
> > > > > > >
> > > > > > >       Erik
> > > > > > >
> > > > > > >
> > > > > > > On Feb 7, 2007, at 2:10 PM, Xavier To wrote:
> > > > > > >
> > > > > > > > Hi, me again
> > > > > > > >
> > > > > > > > I'm still stuck with my search engine, but something 
> popped> > > > in my
> > > > > > >
> > > > > > > > head : Can an analyzer index something but ignore it
> > > during a
> > > > > > > > search ? I'm asking this because now that I've been
> > > searching> > for> >
> > > > > > > > an answer, I've come to think that I should redo the 
> whole> > > > search> >
> > > > > > > > engine, but I don't want to reproduce the same error as
> > > we have
> > > > > > > > now. It would be stupid to accidentaly redo the same
> > > mistake. I
> > > > > > > > still haven't received news from my seniors about me 
> posting> > > > code> >
> > > > > > > > and all...
> > > > > > > >
> > > > > > > > Xavier Tô
> > > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > > (450)434-8905
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ------------------------------------------------------
> ----
> > > ----
> > > > > ----
> > > > > > > ---
> > > > > > > > To unsubscribe, e-mail: java-user-
> > > unsubscribe@lucene.apache.org> > > > > For additional commands, 
> e-
> > > mail: java-user-
> > > > > help@lucene.apache.org> >
> > > > > > >
> > > > > > > --------------------------------------------------------
> ----
> > > ----
> > > > > ----
> > > > > > > -
> > > > > > > To unsubscribe, e-mail: java-user-
> > > unsubscribe@lucene.apache.org> > > > For additional commands, e-
> > > mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > ----------------------------------------------------------
> ----
> > > ----
> > > > > ---
> > > > > > To unsubscribe, e-mail: java-user-
> unsubscribe@lucene.apache.org> > > > > For additional commands, e-
> mail: java-user-
> > > help@lucene.apache.org> > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --------------------------------------------------------------
> ----
> > > ---
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-
> help@lucene.apache.org> > >
> > > >
> > >
> >
> >
> > ------------------------------------------------------------------
> ---
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re : Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers

Posted by Erick Erickson <er...@gmail.com>.

The query should be tokenized *by the query parser*. You shouldn't have to
do the tokenizing yourself. When you print out the results of the parsing,
you should see something like field:value1 field:value2, which are built up
under the covers to be a BooleanQuery with a bunch of clauses.

I think, though, I'm really at the end of any helpful suggestions I can come
up with without looking at some code from both the indexing and querying.
Otherwise, we'll just continue to mislead each other. If you haven't
already, I strongly urge you to get a copy of Lucene In Action since that'll
give you a much more thorough explication of tokenizing than I can.

Best
Erick

On 2/9/07, Xavier To <to...@courrier.uqam.ca> wrote:
>
> Hey, thanks a lot for taking so much time here...
>
> I did check the and they appear to be the same...at least they are same
> class and same package. I just noticed something : they are using
> LowerCaseFilter.... I was going to say "could it be the source of the
> numbers being ignored ?" but it shouldn't since they are indexed (the
> modification of using WhitespaceAnalyzer during the search did return the
> exact number of results for "2002" which is 5.
>
> As for the tokenizing, shouldn't a query be tokenized ? It was already
> like that, and all I did was modify it so it would use Lucene's tokenizing
> methods... If a query shouldn't be tokenized, maybe tokenizing it is the
> problem. If it should be tokenized,  what am I doing wrong that forces me to
> add a single blank after each token ? I mean, I don't understand what the
> analyzer has to do with the tokenizing process... The reason why I add a
> blank is because the tokens are getting appended into a string, and then the
> string is sent through QueryParser.
>
> As I said, I don't really understand why the guy who made this search
> engine didn't just sent the query as a long string instead of tokenizing it,
> but since it was working fine with alphabetical searches, I said to myself
> "it must be the way to do it".
>
> Xavier Tô
> Bacc. en Informatique et Génie Logiciel
> to.xavier@courrier.uqam.ca
> (450)434-8905
>
> ----- Message d'origine -----
> De: Erick Erickson <er...@gmail.com>
> Date: Jeudi, Février 8, 2007 5:13 pm
> Objet: Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers
>
> > See below....
> >
> > On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > >
> > > Thanks for helping me.
> > >
> > > I don't really understand what you mean by my Tokenizer
> > "corrects" what
> > > the indexing analyzer did.
> >
> >
> > You shouldn't have to do change the tokens in the usual case to get
> > thesearch to work right. You mentioned tokenizing the search
> > string, but then
> > having to add whitespaces back in. That step is the step that
> > "corrects"what the analyzer did. I put "corrects" in quotes because
> > it isn't really
> > correcting anything, the analyzers are doing what they should. But
> > if you
> > have to make this manual change, you're trying to fix up the query
> > string to
> > match what the analyzer did at index time. Which will leave you
> > correctingthis, then that, then the other thing when it would be
> > much better just to
> > use the same analyzer if possible. I've just seen too many "oh,
> > there's one
> > more thing" statements in this situation.
> >
> >
> > By the way, the tokenizer we use is one provided in Lucene. My
> > guess is that
> > > the problem was that the analyzer was thought to be the same by
> > the guy who
> > > made the search engine, but the querying analyzer is fetched
> > inside a JAR by
> > > a bean. Could it be that this is the problem ?
> >
> >
> > It shouldn't be if the same analyzer is fetched inside the bean.
> > Can't you
> > check what analyzer is used in both cases?
> >
> > Erick
> >
> >
> > Xavier Tô
> > > Bacc. en Informatique et Génie Logiciel
> > > to.xavier@courrier.uqam.ca
> > > (450)434-8905
> > >
> > > ----- Message d'origine -----
> > > De: Erick Erickson <er...@gmail.com>
> > > Date: Jeudi, Février 8, 2007 12:51 pm
> > > Objet: Re: Re : Re: Re : Re: Question concerning Analyzers
> > >
> > > > Well, you've proved that your problem is that the analyzer you're
> > > > using when
> > > > querying isn't matching what you use during indexing. I think that
> > > > whatyou've done will lead you into significant problems down the
> > > > road as your
> > > > tokenizer then has to "correct" for what the index analyzer did
> > > > though.
> > > > What would probably be MUCH less work in the long run is to
> > align the
> > > > analyzer you use at query time with the analyzer you use at index
> > > > time. You
> > > > can use a PerFieldAnalyzerWrapper to handle different fields in
> > > > differentways. Forget your custom tokenizer for the time being,
> > > > just try using the
> > > > same analyzer during searching that you used during indexing. You
> > > > can use
> > > > the
> > > > *QueryParser<file:///C:/lucene-
> > > >
> > 2.0.0/docs/api/org/apache/lucene/queryParser/QueryParser.html#QueryParser%28java.lang.String,%20org.apache.lucene.analysis.Analyzer%29>*(String>
> <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
> > > > Analyzer<file:///C:/lucene-
> > > > 2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html> a)
> > > >
> > > > form of the QueryParser, where the Analyzer is the same one you
> > > > used when
> > > > indexing. There are some circumstances where you want to use
> > different> > analyzers when querying and when indexing, but don't
> > go there
> > > > unless you
> > > > need to <G>....
> > > >
> > > > If that doesn't do what you want, I'd really recommend is that you
> > > > make your
> > > > own custom Analyzer, built on, say, WhitespaceTokenizer,
> > > > LowerCaseFilter.This is usually the way I've approached this kind
> > > > of problem. And use *that*
> > > > one at index and query time.
> > > >
> > > > There's an example in Lucene In Action, see the SynonymAnalyzer
> > > > example.That example is MUCH more complex than you'll need <G>...
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > > >
> > > > > Hey !
> > > > >
> > > > > I tried using WhitespaceAnalyzer during the search and it
> > works. I
> > > > > refactored the tokenizing process so it uses TokenStream
> > instead of
> > > > > StringTokenizer and it works fine for one thing : the query
> > "this> > is a test"
> > > > > becomes "thisisatest". I fixed it by adding a space after each
> > > > token except
> > > > > for the last one, but is there a clean way to do it ? I'm using
> > > > > WhitespaceTokenizer.
> > > > >
> > > > > Thanks a bunch !
> > > > >
> > > > > Xavier Tô
> > > > > Bacc. en Informatique et Génie Logiciel
> > > > > to.xavier@courrier.uqam.ca
> > > > > (450)434-8905
> > > > >
> > > > > ----- Message d'origine -----
> > > > > De: Erick Erickson <er...@gmail.com>
> > > > > Date: Mercredi, Février 7, 2007 4:28 pm
> > > > > Objet: Re: Re : Re: Question concerning Analyzers
> > > > >
> > > > > > Then the analyzer you're using when parsing the query is
> > stripping> > > > them. It
> > > > > > must be different than the one you use when indexing somehow.
> > > > At least
> > > > > > that's the only explanation I can imagine....
> > > > > >
> > > > > > Perhaps, somehow, you are using a default analyzer when you
> > > > parse a
> > > > > > query?Or you aren't specifying the field when you query and
> > > > thus a
> > > > > > default is
> > > > > > used? Or you are using a PerFieldAnalyzerWrapper and dropping
> > > > > > through to the
> > > > > > default? or ????
> > > > > >
> > > > > > Just for yucks, I'd try using WhitespaceAnalyzer on a query
> > with> > > > somethingyou *know* exists in the index for a
> > particular field and
> > > > > > work my way up to
> > > > > > whatever your real problem is in small steps (since you
> > can't post
> > > > > > code<G>)......
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On 2/7/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > > > > >
> > > > > > > Thanks Erik and Erick,
> > > > > > >
> > > > > > > I guess my question was rather unclear, but you guys
> > answered it
> > > > > > all the
> > > > > > > same : it is impossible for an analyzer to index
> > something and
> > > > > > having the
> > > > > > > same analyzer ignore the thing indexed during a search.
> > > > > > >
> > > > > > > If it makes everything clearer, during indexation,
> > numbers  are
> > > > > > indexed,> whether or not they are accompanied by letters (
> > 2003> > and> > 4wd are both
> > > > > > > indexed). That's fine, since we want this.  The problem
> > occurs> > > > when I try to
> > > > > > > search for them : They are ignored. I know they are indexed
> > > > > > because I ran
> > > > > > > through the index using Luke.
> > > > > > >
> > > > > > > Any thoughts regarding this problem ?
> > > > > > >
> > > > > > > Xavier Tô
> > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > (450)434-8905
> > > > > > >
> > > > > > > ----- Message d'origine -----
> > > > > > > De: Erik Hatcher <er...@ehatchersolutions.com>
> > > > > > > Date: Mercredi, Février 7, 2007 3:15 pm
> > > > > > > Objet: Re: Question concerning Analyzers
> > > > > > >
> > > > > > > > There is no requirement that you use the same analyzer to
> > > > > > search as
> > > > > > > >
> > > > > > > > you used to index.  So, yes, you could certainly index
> > > > things and
> > > > > > > > ignore them during a search.
> > > > > > > >
> > > > > > > >       Erik
> > > > > > > >
> > > > > > > >
> > > > > > > > On Feb 7, 2007, at 2:10 PM, Xavier To wrote:
> > > > > > > >
> > > > > > > > > Hi, me again
> > > > > > > > >
> > > > > > > > > I'm still stuck with my search engine, but something
> > popped> > > > in my
> > > > > > > >
> > > > > > > > > head : Can an analyzer index something but ignore it
> > > > during a
> > > > > > > > > search ? I'm asking this because now that I've been
> > > > searching> > for> >
> > > > > > > > > an answer, I've come to think that I should redo the
> > whole> > > > search> >
> > > > > > > > > engine, but I don't want to reproduce the same error as
> > > > we have
> > > > > > > > > now. It would be stupid to accidentaly redo the same
> > > > mistake. I
> > > > > > > > > still haven't received news from my seniors about me
> > posting> > > > code> >
> > > > > > > > > and all...
> > > > > > > > >
> > > > > > > > > Xavier Tô
> > > > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > > > (450)434-8905
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ------------------------------------------------------
> > ----
> > > > ----
> > > > > > ----
> > > > > > > > ---
> > > > > > > > > To unsubscribe, e-mail: java-user-
> > > > unsubscribe@lucene.apache.org> > > > > For additional commands,
> > e-
> > > > mail: java-user-
> > > > > > help@lucene.apache.org> >
> > > > > > > >
> > > > > > > > --------------------------------------------------------
> > ----
> > > > ----
> > > > > > ----
> > > > > > > > -
> > > > > > > > To unsubscribe, e-mail: java-user-
> > > > unsubscribe@lucene.apache.org> > > > For additional commands, e-
> > > > mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------------------------
> > ----
> > > > ----
> > > > > > ---
> > > > > > > To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org> > > > > For additional commands, e-
> > mail: java-user-
> > > > help@lucene.apache.org> > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --------------------------------------------------------------
> > ----
> > > > ---
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-
> > help@lucene.apache.org> > >
> > > > >
> > > >
> > >
> > >
> > > ------------------------------------------------------------------
> > ---
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>