You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Deepa Paranjpe <de...@yahoo-inc.com> on 2007/02/13 21:11:23 UTC

Problems with "AND" queries

I have small documents indexed. 
When I query the index using a BooleanQuery containing {why,is,the,sky,blue}
with all queries having the MUST BooleanClause, I do not retrieve any
results.
However, when I use only { why,sky,blue} I get results which are 
Why is the sky blue? And several of them.

What is going wrong? Please help. 


-----Original Message-----
From: Stefan Groschupf [mailto:sg@101tec.com] 
Sent: Monday, November 06, 2006 5:18 AM
To: general@lucene.apache.org
Subject: Re: [PROPOSAL] index server project

Hi,

do people think we are already in a stage where we can setup some  
basic infrastructure like mailing list and wiki and move the  
discussion to the new mailing list. Maybe setup a incubator project?

I would be happy to help with such basic tasks.

Stefan



Am 31.10.2006 um 22:03 schrieb Yonik Seeley:

> On 10/30/06, Doug Cutting <cu...@apache.org> wrote:
>> Yonik Seeley wrote:
>> > On 10/18/06, Doug Cutting <cu...@apache.org> wrote:
>> >> We assume that, within an index, a file with a given name is  
>> written
>> >> only once.
>> >
>> > Is this necessary, and will we need the lockless patch (that avoids
>> > renaming or rewriting *any* files), or is Lucene's current index
>> > behavior sufficient?
>>
>> It's not strictly required, but it would make index synchronization a
>> lot simpler. Yes, I was assuming the lockless patch would be  
>> committed
>> to Lucene before this project gets very far.  Something more than  
>> that
>> would be required in order to keep old versions, but this could be as
>> simple as a Directory subclass that refuses to remove files for a  
>> time.
>
> Or a snapshot (hard links) mechanism.
> Lucene would also need a way to open a specific index version (rather
> than just the latest), but I guess that could also be hacked into
> Directory by hiding later "segments" files (assumes lockless is
> committed).
>
>> > It's unfortunate the master needs to be involved on every  
>> document add.
>>
>> That should not normally be the case.
>
> Ahh... I had assumed that "id" in the following method was document  
> id:
>  IndexLocation getUpdateableIndex(String id);
>
> I see now it's index id.
>
> But what is index id exactly?  Looking at the example API you laid
> down, it must be a single physical index (as opposed to a logical
> index).  In which case, is it entirely up to the client to manage
> multi-shard indicies?  For example, if we had a "photo" index broken
> up into 3 shards, each shard would have a separate index id and it
> would be up to the client to know this, and to query across the
> different "photo0", "photo1", "photo2" indicies.  The master would
> have no clue those indicies were related.  Hmmm, that doesn't work
> very well for deletes though.
>
> It seems like there should be the concept of a logical index, that is
> composed of multiple shards, and each shard has multiple copies.
>
> Or were you thinking that a cluster would only contain a single
> logical index, and hence all different index ids are simply different
> shards of that single logical index?  That would seem to be consistent
> with ClientToMasterProtocol .getSearchableIndexes() lacking an id
> argument.
>
>> I was not imagining a real-time system, where the next query after a
>> document is added would always include that document.  Is that a
>> requirement?  That's harder.
>
> Not real-time, but it would be nice if we kept it close to what Lucene
> can currently provide.
> Most people seem fine with a latency of minutes.
>
>> At this point I'm mostly trying to see if this functionality would  
>> meet
>> the needs of Solr, Nutch and others.
>>
>
> It depends on the project scope and how extensible things are.
> It seems like the master would be a WAR, capable of running stand- 
> alone.
> What about index servers (slaves)?  Would this project include just
> the interfaces to be implemented by Solr/Nutch nodes, some common
> implementation code behind the interfaces in the form of a library, or
> also complete standalone WARs?
>
> I'd need to be able to extend the ClientToSlave protocol to add
> additional methods for Solr (for passing in extra parameters and
> returning various extra data such as facets, highlighting, etc).
>
>> Must we include a notion of document identity and/or document  
>> version in
>> the mechanism? Would that facillitate updates and coherency?
>
> It doesn't need to be in the interfaces I don't think, so it depends
> on the scope of the index server implementations.
>
> -Yonik
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: Problems with "AND" queries

Posted by Chris Hostetter <ho...@fucit.org>.
first off: please don't just pick an arbitrary email message to reply to
and change the subject: it makes the list archives very confusing.

second: if you have additional questions baout using lucene, you should
try asking hte user list for the specific port you are using -- i'm
guessing you are using hte java APIs, so that would be hte
java-user@lucene mailing list --- those lists tend to have more people
reading them then general (which is for talking about the Lucene family of
projects, or for asking where you should ask a particular type of
question).

on to your question...

: When I query the index using a BooleanQuery containing {why,is,the,sky,blue}
: with all queries having the MUST BooleanClause, I do not retrieve any
: results.
: However, when I use only { why,sky,blue} I get results which are
: Why is the sky blue? And several of them.

more then likely, when you indexed your documents you used an indexer
which treats "is" and "the" as stop words nd striped them out.

if you used the QueryParser to generate a query for your list of words ,it
would do the same thing (provided you told it the correct analyzer) -- but
if you manually constrcute your TermQUery and BooleanQuery objects
directly you have to do this manually as well (just as you will need to
lower case your terms if you used an analyzer that lowercases when
indexing.




-Hoss