You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Xavier To <to...@courrier.uqam.ca> on 2007/02/09 19:33:07 UTC

Re : Re: Re : Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers

Thanks a lot for all your help. I guess this temporary fix will have to do until I have clearance to post some code. For the current index (that was last modified over a year ago), it works fine, but I know it's not properly done.

Thank you all very much, especially you Mr Erickson.

Xavier Tô
Bacc. en Informatique et Génie Logiciel
to.xavier@courrier.uqam.ca
(450)434-8905

----- Message d'origine -----
De: Erick Erickson <er...@gmail.com>
Date: Vendredi, Février 9, 2007 12:38 pm
Objet: Re: Re : Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers

> The query should be tokenized *by the query parser*. You shouldn't 
> have to
> do the tokenizing yourself. When you print out the results of the 
> parsing,you should see something like field:value1 field:value2, 
> which are built up
> under the covers to be a BooleanQuery with a bunch of clauses.
> 
> I think, though, I'm really at the end of any helpful suggestions I 
> can come
> up with without looking at some code from both the indexing and 
> querying.Otherwise, we'll just continue to mislead each other. If 
> you haven't
> already, I strongly urge you to get a copy of Lucene In Action 
> since that'll
> give you a much more thorough explication of tokenizing than I can.
> 
> Best
> Erick
> 
> On 2/9/07, Xavier To <to...@courrier.uqam.ca> wrote:
> >
> > Hey, thanks a lot for taking so much time here...
> >
> > I did check the and they appear to be the same...at least they 
> are same
> > class and same package. I just noticed something : they are using
> > LowerCaseFilter.... I was going to say "could it be the source of 
> the> numbers being ignored ?" but it shouldn't since they are 
> indexed (the
> > modification of using WhitespaceAnalyzer during the search did 
> return the
> > exact number of results for "2002" which is 5.
> >
> > As for the tokenizing, shouldn't a query be tokenized ? It was 
> already> like that, and all I did was modify it so it would use 
> Lucene's tokenizing
> > methods... If a query shouldn't be tokenized, maybe tokenizing it 
> is the
> > problem. If it should be tokenized,  what am I doing wrong that 
> forces me to
> > add a single blank after each token ? I mean, I don't understand 
> what the
> > analyzer has to do with the tokenizing process... The reason why 
> I add a
> > blank is because the tokens are getting appended into a string, 
> and then the
> > string is sent through QueryParser.
> >
> > As I said, I don't really understand why the guy who made this 
> search> engine didn't just sent the query as a long string instead 
> of tokenizing it,
> > but since it was working fine with alphabetical searches, I said 
> to myself
> > "it must be the way to do it".
> >
> > Xavier Tô
> > Bacc. en Informatique et Génie Logiciel
> > to.xavier@courrier.uqam.ca
> > (450)434-8905
> >
> > ----- Message d'origine -----
> > De: Erick Erickson <er...@gmail.com>
> > Date: Jeudi, Février 8, 2007 5:13 pm
> > Objet: Re: Re : Re: Re : Re: Re : Re: Question concerning Analyzers
> >
> > > See below....
> > >
> > > On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > >
> > > > Thanks for helping me.
> > > >
> > > > I don't really understand what you mean by my Tokenizer
> > > "corrects" what
> > > > the indexing analyzer did.
> > >
> > >
> > > You shouldn't have to do change the tokens in the usual case to 
> get> > thesearch to work right. You mentioned tokenizing the search
> > > string, but then
> > > having to add whitespaces back in. That step is the step that
> > > "corrects"what the analyzer did. I put "corrects" in quotes 
> because> > it isn't really
> > > correcting anything, the analyzers are doing what they should. But
> > > if you
> > > have to make this manual change, you're trying to fix up the query
> > > string to
> > > match what the analyzer did at index time. Which will leave you
> > > correctingthis, then that, then the other thing when it would be
> > > much better just to
> > > use the same analyzer if possible. I've just seen too many "oh,
> > > there's one
> > > more thing" statements in this situation.
> > >
> > >
> > > By the way, the tokenizer we use is one provided in Lucene. My
> > > guess is that
> > > > the problem was that the analyzer was thought to be the same by
> > > the guy who
> > > > made the search engine, but the querying analyzer is fetched
> > > inside a JAR by
> > > > a bean. Could it be that this is the problem ?
> > >
> > >
> > > It shouldn't be if the same analyzer is fetched inside the bean.
> > > Can't you
> > > check what analyzer is used in both cases?
> > >
> > > Erick
> > >
> > >
> > > Xavier Tô
> > > > Bacc. en Informatique et Génie Logiciel
> > > > to.xavier@courrier.uqam.ca
> > > > (450)434-8905
> > > >
> > > > ----- Message d'origine -----
> > > > De: Erick Erickson <er...@gmail.com>
> > > > Date: Jeudi, Février 8, 2007 12:51 pm
> > > > Objet: Re: Re : Re: Re : Re: Question concerning Analyzers
> > > >
> > > > > Well, you've proved that your problem is that the analyzer 
> you're> > > > using when
> > > > > querying isn't matching what you use during indexing. I 
> think that
> > > > > whatyou've done will lead you into significant problems 
> down the
> > > > > road as your
> > > > > tokenizer then has to "correct" for what the index analyzer 
> did> > > > though.
> > > > > What would probably be MUCH less work in the long run is to
> > > align the
> > > > > analyzer you use at query time with the analyzer you use at 
> index> > > > time. You
> > > > > can use a PerFieldAnalyzerWrapper to handle different 
> fields in
> > > > > differentways. Forget your custom tokenizer for the time 
> being,> > > > just try using the
> > > > > same analyzer during searching that you used during 
> indexing. You
> > > > > can use
> > > > > the
> > > > > *QueryParser<file:///C:/lucene-
> > > > >
> > > 
> 2.0.0/docs/api/org/apache/lucene/queryParser/QueryParser.html#QueryParser%28java.lang.String,%20org.apache.lucene.analysis.Analyzer%29>*(String>> <http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
> > > > > Analyzer<file:///C:/lucene-
> > > > > 2.0.0/docs/api/org/apache/lucene/analysis/Analyzer.html> a)
> > > > >
> > > > > form of the QueryParser, where the Analyzer is the same one 
> you> > > > used when
> > > > > indexing. There are some circumstances where you want to use
> > > different> > analyzers when querying and when indexing, but don't
> > > go there
> > > > > unless you
> > > > > need to <G>....
> > > > >
> > > > > If that doesn't do what you want, I'd really recommend is 
> that you
> > > > > make your
> > > > > own custom Analyzer, built on, say, WhitespaceTokenizer,
> > > > > LowerCaseFilter.This is usually the way I've approached 
> this kind
> > > > > of problem. And use *that*
> > > > > one at index and query time.
> > > > >
> > > > > There's an example in Lucene In Action, see the 
> SynonymAnalyzer> > > > example.That example is MUCH more complex 
> than you'll need <G>...
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On 2/8/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > > > >
> > > > > > Hey !
> > > > > >
> > > > > > I tried using WhitespaceAnalyzer during the search and it
> > > works. I
> > > > > > refactored the tokenizing process so it uses TokenStream
> > > instead of
> > > > > > StringTokenizer and it works fine for one thing : the query
> > > "this> > is a test"
> > > > > > becomes "thisisatest". I fixed it by adding a space after 
> each> > > > token except
> > > > > > for the last one, but is there a clean way to do it ? I'm 
> using> > > > > WhitespaceTokenizer.
> > > > > >
> > > > > > Thanks a bunch !
> > > > > >
> > > > > > Xavier Tô
> > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > to.xavier@courrier.uqam.ca
> > > > > > (450)434-8905
> > > > > >
> > > > > > ----- Message d'origine -----
> > > > > > De: Erick Erickson <er...@gmail.com>
> > > > > > Date: Mercredi, Février 7, 2007 4:28 pm
> > > > > > Objet: Re: Re : Re: Question concerning Analyzers
> > > > > >
> > > > > > > Then the analyzer you're using when parsing the query is
> > > stripping> > > > them. It
> > > > > > > must be different than the one you use when indexing 
> somehow.> > > > At least
> > > > > > > that's the only explanation I can imagine....
> > > > > > >
> > > > > > > Perhaps, somehow, you are using a default analyzer when 
> you> > > > parse a
> > > > > > > query?Or you aren't specifying the field when you query 
> and> > > > thus a
> > > > > > > default is
> > > > > > > used? Or you are using a PerFieldAnalyzerWrapper and 
> dropping> > > > > > through to the
> > > > > > > default? or ????
> > > > > > >
> > > > > > > Just for yucks, I'd try using WhitespaceAnalyzer on a 
> query> > with> > > > somethingyou *know* exists in the index for a
> > > particular field and
> > > > > > > work my way up to
> > > > > > > whatever your real problem is in small steps (since you
> > > can't post
> > > > > > > code<G>)......
> > > > > > >
> > > > > > > Best
> > > > > > > Erick
> > > > > > >
> > > > > > > On 2/7/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > > > > > >
> > > > > > > > Thanks Erik and Erick,
> > > > > > > >
> > > > > > > > I guess my question was rather unclear, but you guys
> > > answered it
> > > > > > > all the
> > > > > > > > same : it is impossible for an analyzer to index
> > > something and
> > > > > > > having the
> > > > > > > > same analyzer ignore the thing indexed during a search.
> > > > > > > >
> > > > > > > > If it makes everything clearer, during indexation,
> > > numbers  are
> > > > > > > indexed,> whether or not they are accompanied by 
> letters (
> > > 2003> > and> > 4wd are both
> > > > > > > > indexed). That's fine, since we want this.  The problem
> > > occurs> > > > when I try to
> > > > > > > > search for them : They are ignored. I know they are 
> indexed> > > > > > because I ran
> > > > > > > > through the index using Luke.
> > > > > > > >
> > > > > > > > Any thoughts regarding this problem ?
> > > > > > > >
> > > > > > > > Xavier Tô
> > > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > > (450)434-8905
> > > > > > > >
> > > > > > > > ----- Message d'origine -----
> > > > > > > > De: Erik Hatcher <er...@ehatchersolutions.com>
> > > > > > > > Date: Mercredi, Février 7, 2007 3:15 pm
> > > > > > > > Objet: Re: Question concerning Analyzers
> > > > > > > >
> > > > > > > > > There is no requirement that you use the same 
> analyzer to
> > > > > > > search as
> > > > > > > > >
> > > > > > > > > you used to index.  So, yes, you could certainly index
> > > > > things and
> > > > > > > > > ignore them during a search.
> > > > > > > > >
> > > > > > > > >       Erik
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Feb 7, 2007, at 2:10 PM, Xavier To wrote:
> > > > > > > > >
> > > > > > > > > > Hi, me again
> > > > > > > > > >
> > > > > > > > > > I'm still stuck with my search engine, but something
> > > popped> > > > in my
> > > > > > > > >
> > > > > > > > > > head : Can an analyzer index something but ignore it
> > > > > during a
> > > > > > > > > > search ? I'm asking this because now that I've been
> > > > > searching> > for> >
> > > > > > > > > > an answer, I've come to think that I should redo the
> > > whole> > > > search> >
> > > > > > > > > > engine, but I don't want to reproduce the same 
> error as
> > > > > we have
> > > > > > > > > > now. It would be stupid to accidentaly redo the same
> > > > > mistake. I
> > > > > > > > > > still haven't received news from my seniors about me
> > > posting> > > > code> >
> > > > > > > > > > and all...
> > > > > > > > > >
> > > > > > > > > > Xavier Tô
> > > > > > > > > > Bacc. en Informatique et Génie Logiciel
> > > > > > > > > > to.xavier@courrier.uqam.ca
> > > > > > > > > > (450)434-8905
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --------------------------------------------------
> ----
> > > ----
> > > > > ----
> > > > > > > ----
> > > > > > > > > ---
> > > > > > > > > > To unsubscribe, e-mail: java-user-
> > > > > unsubscribe@lucene.apache.org> > > > > For additional 
> commands,> > e-
> > > > > mail: java-user-
> > > > > > > help@lucene.apache.org> >
> > > > > > > > >
> > > > > > > > > ----------------------------------------------------
> ----
> > > ----
> > > > > ----
> > > > > > > ----
> > > > > > > > > -
> > > > > > > > > To unsubscribe, e-mail: java-user-
> > > > > unsubscribe@lucene.apache.org> > > > For additional 
> commands, e-
> > > > > mail: java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ------------------------------------------------------
> ----
> > > ----
> > > > > ----
> > > > > > > ---
> > > > > > > > To unsubscribe, e-mail: java-user-
> > > unsubscribe@lucene.apache.org> > > > > For additional commands, 
> e-
> > > mail: java-user-
> > > > > help@lucene.apache.org> > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > ----------------------------------------------------------
> ----
> > > ----
> > > > > ---
> > > > > > To unsubscribe, e-mail: java-user-
> unsubscribe@lucene.apache.org> > > > > For additional commands, e-
> mail: java-user-
> > > help@lucene.apache.org> > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --------------------------------------------------------------
> ----
> > > ---
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-
> help@lucene.apache.org> > >
> > > >
> > >
> >
> >
> > ------------------------------------------------------------------
> ---
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org