You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2004/03/31 16:49:19 UTC

Wierd Search Behavior

I'm experiencing some very puzzling search behavior.  I am using the CVS head I pulled about a week ago.  I use the StandardAnalyzer and QueryParser.  I have a collection of XML documents indexed.  One field is "subhead", and here's what I find with different queries:
subhead:(missile defense)    - works fine 
subhead("missile" "defense") - works fine
subhead("missile defense") - fails
subhead(missile defense "missile defense") - fails
subhead(missile defense "missile dork") - works fine
subhead(missile defense "missile defens") - works fine (note misspelling)

At the moment, I can't find any other field or phrase that does this.  However, according to my notes (as I'm no longer trusting my mind on this), about a week ago (about the time I started using the new CVS version) I noticed similar behavior with the query 'subhead:"al qaeda" - but that now works perfectly fine! Same thing with the query 'summary:"heart disease"; it failed to work and then a day or so later, it worked.  (I merge new documents into the master index each day.)

Any ideas on what might possibly be going on would be very much appreciated.

Regards,

Terry


Re: Wierd Search Behavior

Posted by Doug Cutting <cu...@apache.org>.
Terry,

Can you please try to develop a reproducible test case?  Otherwise it's 
impossible to verify and debug this.

For something like this it would suffice to provide:

   1. The initial index, which satisifies the test queries;

   2. The new index you add;

   3. Your merge and test code, as a single class that illustrates the 
problem.

The smaller the indexes the better: not only will it be easier to 
transfer them, but debugging will be faster.

Also, you should add a bug to track this, at:

   http://issues.apache.org/bugzilla/enter_bug.cgi?product=Lucene

Doug

Terry Steichen wrote:
> I did some more checking and uncovered what appears to be a serious Lucene
> problem. (Either that or my merge code - below - is wrong.)  Appreciate any
> help in figuring out what's wrong.  Here are the facts as I see them:
> 
> 1) I put together a large number of canned queries (some rather complex) for
> routine testing purposes.
> 2) I created a new compound file index and tested the queries.  All worked
> fine.
> 3) I then indexed some new documents and merged the new index with the
> original index.
> 4) I then tried the queries again.  Each time I did this, about 1-3% of the
> queries no longer worked - the actual number appears to vary with each
> merge.
> 5) The specific queries that fail change with each merge. Ones that failed
> after the previous merge almost always appear to work again with the next
> merge (which produces a new batch of failures).
> 6) In all cases I've so far examined, the offending part of the affected
> queries is a single quoted phrase (even though there may be several such
> phrases in the query) - remove it, and the (now modified) query works fine.
> 7) I tried the same thing using the original multi-file index format, with
> the same results.
> 8) About a week and a half ago, I migrated from 1.3final to the latest CVS
> head.
> 9) I've only just started checking this, so I don't know how long this
> behavior has been going on.  The small percentage of errors and (apparent)
> randomness of which query is affected make it hard to detect.
> 10) I have about 32 fields per document, most of which are tokenized,
> indexed and stored.
> 11) My merge code (for the multi-file index format) is this:
> 
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.store.FSDirectory;
> 
> class MergeIndices {
>   public static void main(String[] args) {
> 
>  //args[0]: relative path to main index
>  //args[1]: relative path to new index (to be merged with main)
> 
>  try {
>   IndexWriter writer = new IndexWriter(args[0], new StandardAnalyzer(),
> false);
>  // writer.setUseCompoundFile(true); //used for compound format
>   FSDirectory dir = FSDirectory.getDirectory(args[1], false);
>   FSDirectory[] dirs = new FSDirectory[1];
>   dirs[0] = dir;
>   writer.addIndexes(dirs);
>   writer.optimize();
>   writer.close();
>  } catch (Exception e) {
>   System.out.println(" caught a " + e.getClass() +
>     "\n with message: " + e.getMessage());
>  }
>   }
> 
> }
> 
> 
> 
> ----- Original Message -----
> From: "Terry Steichen" <te...@net-frame.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, March 31, 2004 11:47 AM
> Subject: Re: Wierd Search Behavior
> 
> 
> 
>>No, they're typos in the e-mail.  In the application, all the colons are
>>properly placed.  (Guess I was/am so frustrated I can't write right any
>>more).
>>
>>Terry
>>
>>----- Original Message -----
>>From: "Erik Hatcher" <er...@ehatchersolutions.com>
>>To: "Lucene Users List" <lu...@jakarta.apache.org>
>>Sent: Wednesday, March 31, 2004 9:55 AM
>>Subject: Re: Wierd Search Behavior
>>
>>
>>
>>>On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
>>>
>>>>I'm experiencing some very puzzling search behavior.  I am using the
>>>>CVS head I pulled about a week ago.  I use the StandardAnalyzer and
>>>>QueryParser.  I have a collection of XML documents indexed.  One field
>>>>is "subhead", and here's what I find with different queries:
>>>>subhead:(missile defense)    - works fine
>>>>subhead("missile" "defense") - works fine
>>>>subhead("missile defense") - fails
>>>>subhead(missile defense "missile defense") - fails
>>>>subhead(missile defense "missile dork") - works fine
>>>>subhead(missile defense "missile defens") - works fine (note
>>>>misspelling)
>>>
>>>I presume the missing colons on all but the first example is just a
>>>typo in your e-mail?  If not, might that be the problem?
>>>
>>>Erik
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Wierd Search Behavior

Posted by Terry Steichen <te...@net-frame.com>.
I did some more checking and uncovered what appears to be a serious Lucene
problem. (Either that or my merge code - below - is wrong.)  Appreciate any
help in figuring out what's wrong.  Here are the facts as I see them:

1) I put together a large number of canned queries (some rather complex) for
routine testing purposes.
2) I created a new compound file index and tested the queries.  All worked
fine.
3) I then indexed some new documents and merged the new index with the
original index.
4) I then tried the queries again.  Each time I did this, about 1-3% of the
queries no longer worked - the actual number appears to vary with each
merge.
5) The specific queries that fail change with each merge. Ones that failed
after the previous merge almost always appear to work again with the next
merge (which produces a new batch of failures).
6) In all cases I've so far examined, the offending part of the affected
queries is a single quoted phrase (even though there may be several such
phrases in the query) - remove it, and the (now modified) query works fine.
7) I tried the same thing using the original multi-file index format, with
the same results.
8) About a week and a half ago, I migrated from 1.3final to the latest CVS
head.
9) I've only just started checking this, so I don't know how long this
behavior has been going on.  The small percentage of errors and (apparent)
randomness of which query is affected make it hard to detect.
10) I have about 32 fields per document, most of which are tokenized,
indexed and stored.
11) My merge code (for the multi-file index format) is this:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;

class MergeIndices {
  public static void main(String[] args) {

 //args[0]: relative path to main index
 //args[1]: relative path to new index (to be merged with main)

 try {
  IndexWriter writer = new IndexWriter(args[0], new StandardAnalyzer(),
false);
 // writer.setUseCompoundFile(true); //used for compound format
  FSDirectory dir = FSDirectory.getDirectory(args[1], false);
  FSDirectory[] dirs = new FSDirectory[1];
  dirs[0] = dir;
  writer.addIndexes(dirs);
  writer.optimize();
  writer.close();
 } catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
    "\n with message: " + e.getMessage());
 }
  }

}



----- Original Message -----
From: "Terry Steichen" <te...@net-frame.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, March 31, 2004 11:47 AM
Subject: Re: Wierd Search Behavior


> No, they're typos in the e-mail.  In the application, all the colons are
> properly placed.  (Guess I was/am so frustrated I can't write right any
> more).
>
> Terry
>
> ----- Original Message -----
> From: "Erik Hatcher" <er...@ehatchersolutions.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, March 31, 2004 9:55 AM
> Subject: Re: Wierd Search Behavior
>
>
> > On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> > > I'm experiencing some very puzzling search behavior.  I am using the
> > > CVS head I pulled about a week ago.  I use the StandardAnalyzer and
> > > QueryParser.  I have a collection of XML documents indexed.  One field
> > > is "subhead", and here's what I find with different queries:
> > > subhead:(missile defense)    - works fine
> > > subhead("missile" "defense") - works fine
> > > subhead("missile defense") - fails
> > > subhead(missile defense "missile defense") - fails
> > > subhead(missile defense "missile dork") - works fine
> > > subhead(missile defense "missile defens") - works fine (note
> > > misspelling)
> >
> > I presume the missing colons on all but the first example is just a
> > typo in your e-mail?  If not, might that be the problem?
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Wierd Search Behavior

Posted by Terry Steichen <te...@net-frame.com>.
No, they're typos in the e-mail.  In the application, all the colons are
properly placed.  (Guess I was/am so frustrated I can't write right any
more).

Terry

----- Original Message -----
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, March 31, 2004 9:55 AM
Subject: Re: Wierd Search Behavior


> On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> > I'm experiencing some very puzzling search behavior.  I am using the
> > CVS head I pulled about a week ago.  I use the StandardAnalyzer and
> > QueryParser.  I have a collection of XML documents indexed.  One field
> > is "subhead", and here's what I find with different queries:
> > subhead:(missile defense)    - works fine
> > subhead("missile" "defense") - works fine
> > subhead("missile defense") - fails
> > subhead(missile defense "missile defense") - fails
> > subhead(missile defense "missile dork") - works fine
> > subhead(missile defense "missile defens") - works fine (note
> > misspelling)
>
> I presume the missing colons on all but the first example is just a
> typo in your e-mail?  If not, might that be the problem?
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Wierd Search Behavior

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 31, 2004, at 9:49 AM, Terry Steichen wrote:\
> I'm experiencing some very puzzling search behavior.  I am using the 
> CVS head I pulled about a week ago.  I use the StandardAnalyzer and 
> QueryParser.  I have a collection of XML documents indexed.  One field 
> is "subhead", and here's what I find with different queries:
> subhead:(missile defense)    - works fine
> subhead("missile" "defense") - works fine
> subhead("missile defense") - fails
> subhead(missile defense "missile defense") - fails
> subhead(missile defense "missile dork") - works fine
> subhead(missile defense "missile defens") - works fine (note 
> misspelling)

I presume the missing colons on all but the first example is just a 
typo in your e-mail?  If not, might that be the problem?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org