You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Xavier To <to...@courrier.uqam.ca> on 2007/02/05 19:03:51 UTC

Re : Re: Re : Re: Problem with a search engine

Thanks for your help !

Wow, I never expected that many replies. Cool !

I did try to print out the query before and after it gets processed by QueryParser and let say my query is "2003", before and after it will be "2003". If I put "report 2003" the query will be, before and after getting into the parser, "report 2003". Problem is that documents are indexed with "2003" or so Luke says but they are never found by our search engine. So even if I search "report AND 2003" I will only get results for the "report" token and nothing for "2003" even if there are 7 documents found by Luke, using the StandardAnalyzer (which is the one we're using). I would restart a new search engine from scratch, not refactoring Java to Lucene but using Lucene from the start, but it would require so many modification elsewhere....

Xavier Tô
Bacc. en Informatique et Génie Logiciel
to.xavier@courrier.uqam.ca
(450)434-8905

----- Message d'origine -----
De: Erick Erickson <er...@gmail.com>
Date: Lundi, Février 5, 2007 12:37 pm
Objet: Re: Re : Re: Problem with a search engine

> Have you tried looking at the actual query submitted with 
> Query.toString()?That might give you an insight into what is 
> actually being submitted to
> Lucene and a place to start.
> 
> Also be aware that QueryParser, the default operator is OR which 
> can produce
> unexpected results if you assume AND.
> 
> Best
> Erick
> 
> On 2/5/07, Xavier To <to...@courrier.uqam.ca> wrote:
> >
> > Thanks for your help,
> >
> > As I stated before, the numbers, whether pure or not, are 
> indexed, for I
> > can search them with luke. But supposing what you're saying was 
> the case,
> > the search for "10-year" should return 4 items (according to the 
> number of
> > occurence found by luke). Problem is that the number of documents 
> returned> is 6, for it ignored the "10" and searched for "-year".
> >
> > Xavier Tô
> > Bacc. en Informatique et Génie Logiciel
> > to.xavier@courrier.uqam.ca
> > (450)434-8905
> >
> > ----- Message d'origine -----
> > De: Mark Miller <ma...@gmail.com>
> > Date: Lundi, Février 5, 2007 11:11 am
> > Objet: Re: Problem with a search engine
> >
> > > StandardAnalyzer does not index pure numbers. It will index
> > > alphanumerictokens and numbers that are connected with one of:
> > > "_"|"-"|"/"|"."|"," If
> > > you wish to index pure numbers you might want to add another 
> regex to
> > > StandardAnalyzer that recognizes a series of digits - don't forget
> > > to add
> > > the new token type to the grammar lower in the 
> StandardTokenizer.jj> > file.
> > > - Mark
> > >
> > > On 2/5/07, Xavier To <to...@courrier.uqam.ca> wrote:
> > > >
> > > > Thanks for taking time to answer me. The problem is that I'm not
> > > allowed> to post code due to a confidentiality contract that I was
> > > required to sign.
> > > > I'll try to see if I can get a special permission to post code
> > > since I'm
> > > > wasting so much time trying to find the answer to this.
> > > >
> > > > I tried looking for each time the query is touched and numbers
> > > are still
> > > > present in the query. I don't know if it's the analyzer, but if
> > > it was,
> > > > woundl't the numbers be cut out of the index completely ? As I
> > > said in my
> > > > 1st post, they are "findable" with Lukeall. If I read right, the
> > > > FrenchAnalyzer included in lucene is supposed to be based on
> > > > StandardAnalyzer so I really fail to see what is going wrong.
> > > Might it be
> > > > the fact that the tokenizer used is Stringtokenizer and not
> > > Tokenstream ?
> > > > The numbers are tokenized, and in the returned query they are
> > > present....>
> > > > I really don't know where they get zapped out of existence...
> > > >
> > > > Thanks again for helping.
> > > >
> > > > Xavier Tô
> > > > Bacc. en Informatique et Génie Logiciel
> > > > to.xavier@courrier.uqam.ca
> > > > (450)434-8905
> > > >
> > > >
> > > > --------------------------------------------------------------
> ----
> > > --------------------
> > > >
> > > > Hard to tell without seeing any code.  Perhaps numbers are being
> > > removed> from the query string
> > > > during search.
> > > > Make sure the same or at least "compatible" Analyzer is used
> > > during both
> > > > indexing and querying.
> > > > Grab the code from Lucene in Action .... hm, lucenebook.com may
> > > be down at
> > > > the moment, but
> > > > that's where you can get the code normally.  The code 
> includes some
> > > > classes that let you run
> > > > a query string through a set of Analyzers and see how each of
> > > them behaves
> > > > and what it does
> > > > to a query.
> > > >
> > > > Otis
> > > >
> > > > ----- Original Message ----
> > > > From: "To, Xavier" <Xa...@axa-canada.com>
> > > > To: java-user@lucene.apache.org
> > > > Sent: Wednesday, January 31, 2007 12:21:27 AM
> > > > Subject: Problem with a search engine
> > > >
> > > >
> > > > Hi, I recently started an internship and I've been asked to fix
> > > their> search engine so numbers are searched. At first, I thought
> > > it was the
> > > > Analyzer that wasn't working right, but we're using
> > > StandardAnalyzer and
> > > > the numbers are indexed (I checked with Lukeall). Then I thought
> > > they> are not tokenized during the search, but they are. They just
> > > seem to be
> > > > ignored for some reason. Did anyone experienced something 
> similar> > ? If
> > > > so, how can I fix this ? It's probably something that would jump
> > > in my
> > > > face if it was alive, but I just can't see it. Can anyone 
> help me
> > > ? It
> > > > would be very much appreciated.
> > > >
> > > >
> > > > Xavier T�
> > > > Stagiaire
> > > > D�veloppement - Maintenance & �volution
> > > > AXA Canada Tech
> > > > 2020, rue University, bureau 700
> > > > Montr�al(Qu�bec)H3A 2A5
> > > > T�l. :  (514) 282-6817, poste 2224
> > > > T�l�c. :  (514) 282-6017
> > > > Courriel : Xavier.To@axa-canada.com <Xa...@axa-canada.com>
> > > >   _____
> > > >
> > > > "Ce message est confidentiel, � l'usage exclusif du destinataire
> > > > ci-dessus et son contenu ne repr�sente en aucun cas un 
> engagement> > de la
> > > > part de AXA, sauf en cas de stipulation expresse et par �crit de
> > > la part
> > > > de AXA. Toute publication, utilisation ou diffusion, m�me 
> partielle,> > > doit �tre autoris�e pr�alablement. Si vous n'�tes pas
> > > destinataire de ce
> > > > message, merci d'en avertir imm�diatement l'exp�diteur."
> > > >
> > > > "This e-mail message is confidential, for the exclusive use 
> of the
> > > > addressee and its contents shall not constitute a commitment 
> by AXA,
> > > > except as otherwise specifically provided in writing by AXA. Any
> > > > unauthorized disclosure, use or dissemination, either whole or
> > > partial,> is prohibited. If you are not the intended recipient of
> > > the message,
> > > > please notify the sender immediately."
> > > >
> > > > --------------------------------------------------------------
> ----
> > > ---
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-
> help@lucene.apache.org> > >
> > > >
> > > >
> > > >
> > > > --------------------------------------------------------------
> ----
> > > ---
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-
> help@lucene.apache.org> > >
> > > >
> > >
> >
> >
> > ------------------------------------------------------------------
> ---
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org