You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mek <me...@gmail.com> on 2006/10/16 07:59:49 UTC

constructing smaller phrase queries given a multi-word query

Has anyone dealt with the problem of constructing sub-queries given a
multi-word query ?

Here is an example to illustrate what I mean:

user queries for -> A B C D
right now I change that query to "A B C D" A B C D to give phrase
matches higher weightage.

What might happen though, is that the user is looking for a document
where "A B" in Field1 & "C D"  in Field2.

So I should ideally be constructing the query as :

"A B C D"^20 "A B"^10 "C D"^10 "B C D"^15  "A B C"^15 A B C D

Has someone solved this problem, are there other ways to handle this problem ?


Thanks,
mek.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: constructing smaller phrase queries given a multi-word query

Posted by Chris Hostetter <ho...@fucit.org>.

: eg. "rowling goblet of fire" - need to match rowling in 1 field &
: "goblet of fire in another
: "hilary duff most wanted" - need to match "hilary duff" in 1 field &
: "most wanted" in another

: >  Why not just index those separate fields into the yet a third field and
: > search there?
: >
: > Or why not just put it all into one field in the first place?
:
: I need the ability to boost matches in certain fields higher than others.
: So both the above approaches do not work for me.
:
: Not all fields have the same analyzer, so thats another reason for not
: using 1 catch-all field.

your "catchall" file can have yet-another-analyzer ... typically a more
simplistic analyzer then the field specifcs ones, the important thing is
that the user input is parsed using the appropriate analyzer for each
field.

: I am trying out DisjunctionMaxQueries  and will soon move to it.
: I first want a phrase match to be done & if that fails then a non-phrase match.
: My problem is that I cant easily decide which phrases to build given a
: 4+word query from the user.

The initial approach you described to build all of the possible
permutations is the best approach i can think of - using DisMax to group
things in such a way that ensure that no one term/phrase common in
multiple fields dominates the score ... you'll also probably want to play
with BooleanQuery.setMinNumberShouldMatch and disabling coord ...

...personally i'v never really tried tackling what you describe, because
queries are typically either simple enough that using it as a simple
phrase across multiple fields works, or it's compliacted enough that you
aren't going to find good "sub phrase" matches in a general way.  relying
on matching the individual terms usually covers those cases well enough
(because you've got enough terms to get a meaningful score)

the one exception that i've been considering persuing for one project is
to recognize *really* common phrases that should really be thought of as a
single term (ie: "digital camera") ... but in most cases they probably be
dealt with in the analyzers using synonyms that collapse the multiple
tokens down to a single marker token (ie: digital_camera)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: constructing smaller phrase queries given a multi-word query

Posted by Mekin Maheshwari <me...@gmail.com>.

On 10/19/06, Erick Erickson <er...@gmail.com> wrote:
> What is the use case you're trying to solve? It doesn't make sense to me
> that you want to take a query from a user and split it over fields under the
> covers.

Well I am planning on doing exactly that, given that we have seen some
amount of user queries which need to match across fields.

eg. "rowling goblet of fire" - need to match rowling in 1 field &
"goblet of fire in another
"hilary duff most wanted" - need to match "hilary duff" in 1 field &
"most wanted" in another


>
>  Why not just index those separate fields into the yet a third field and
> search there?
>
> Or why not just put it all into one field in the first place?

I need the ability to boost matches in certain fields higher than others.
So both the above approaches do not work for me.

Not all fields have the same analyzer, so thats another reason for not
using 1 catch-all field.

I am trying out DisjunctionMaxQueries  and will soon move to it.
I first want a phrase match to be done & if that fails then a non-phrase match.
My problem is that I cant easily decide which phrases to build given a
4+word query from the user.

Thanks Erik for the response,
Hope I have done a better job of explaining my problem,

-mek



>
> The more details of what you're trying to do you provide, the better answers
> you'll get <G>..




>
> Best
> Erick
>
>
>
> On 10/18/06, Mekin Maheshwari <me...@gmail.com> wrote:
> >
> > Resending, with the hope that the search gurus missed this.
> >
> > Would really appreciate any advise on this.
> > Would not want to reinvent the wheel & I am sure this is something
> > that would have been done.
> >
> > Thanks,
> > mek
> >
> > On 10/16/06, Mek <me...@gmail.com> wrote:
> > > Has anyone dealt with the problem of constructing sub-queries given a
> > > multi-word query ?
> > >
> > > Here is an example to illustrate what I mean:
> > >
> > > user queries for -> A B C D
> > > right now I change that query to "A B C D" A B C D to give phrase
> > > matches higher weightage.
> > >
> > > What might happen though, is that the user is looking for a document
> > > where "A B" in Field1 & "C D"  in Field2.
> > >
> > > So I should ideally be constructing the query as :
> > >
> > > "A B C D"^20 "A B"^10 "C D"^10 "B C D"^15  "A B C"^15 A B C D
> > >
> > > Has someone solved this problem, are there other ways to handle this
> > problem ?
> > >
> > >
> > > Thanks,
> > > mek.
> > >
> >
> >
> > --
> > http://mekin.livejournal.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>


-- 
http://mekin.livejournal.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: constructing smaller phrase queries given a multi-word query

Posted by Erick Erickson <er...@gmail.com>.

What is the use case you're trying to solve? It doesn't make sense to me
that you want to take a query from a user and split it over fields under the
covers.

 Why not just index those separate fields into the yet a third field and
search there?

Or why not just put it all into one field in the first place?

The more details of what you're trying to do you provide, the better answers
you'll get <G>..

Best
Erick



On 10/18/06, Mekin Maheshwari <me...@gmail.com> wrote:
>
> Resending, with the hope that the search gurus missed this.
>
> Would really appreciate any advise on this.
> Would not want to reinvent the wheel & I am sure this is something
> that would have been done.
>
> Thanks,
> mek
>
> On 10/16/06, Mek <me...@gmail.com> wrote:
> > Has anyone dealt with the problem of constructing sub-queries given a
> > multi-word query ?
> >
> > Here is an example to illustrate what I mean:
> >
> > user queries for -> A B C D
> > right now I change that query to "A B C D" A B C D to give phrase
> > matches higher weightage.
> >
> > What might happen though, is that the user is looking for a document
> > where "A B" in Field1 & "C D"  in Field2.
> >
> > So I should ideally be constructing the query as :
> >
> > "A B C D"^20 "A B"^10 "C D"^10 "B C D"^15  "A B C"^15 A B C D
> >
> > Has someone solved this problem, are there other ways to handle this
> problem ?
> >
> >
> > Thanks,
> > mek.
> >
>
>
> --
> http://mekin.livejournal.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: NativeFSLockFactory problem

Posted by Frank Kunemann <fr...@innosystec.de>.

Hi Mike,

no problem. Just good to know its not my fault this time... ;)


Regards,
Frank

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Thursday, October 19, 2006 12:03 PM
To: java-user@lucene.apache.org
Subject: Re: NativeFSLockFactory problem

Frank Kunemann wrote:
>  
> Hi all,
> 
> I'm trying to use the new class NativeFSLockFactory, but as you can 
> guess I have a problem using it.
> Don't know what I'm doing wrong, so here is the code:

There is a serious bug with NativeFSLockFactory as it now stands -- it's
precisely the issue you've come across: different directories end up
incorrectly sharing the same lock file.  I am working on the fix and will
re-submit the patch soon.  Sorry about this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: NativeFSLockFactory problem

Posted by Michael McCandless <lu...@mikemccandless.com>.

Frank Kunemann wrote:
>  
> Hi all,
> 
> I'm trying to use the new class NativeFSLockFactory, but as you can guess I
> have a problem using it.
> Don't know what I'm doing wrong, so here is the code:

There is a serious bug with NativeFSLockFactory as it now stands --
it's precisely the issue you've come across: different directories end
up incorrectly sharing the same lock file.  I am working on the fix
and will re-submit the patch soon.  Sorry about this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

NativeFSLockFactory problem

Posted by Frank Kunemann <fr...@innosystec.de>.

 
Hi all,

I'm trying to use the new class NativeFSLockFactory, but as you can guess I
have a problem using it.
Don't know what I'm doing wrong, so here is the code:


FSDirectory dir = FSDirectory.getDirectory(indexDir, create,
NativeFSLockFactory.getLockFactory());
logger.info("Index: "+indexDir.getAbsolutePath()+" Lock file:
"+dir.getLockID());
this.writer = new IndexWriter(dir, new StandardAnalyzer(), create);


Just to explain: indexDir is a file, create is set to false. 2nd line is to
see what is going on.


My problem is that there are many indices, for testing purpose just 4 of
them. The first one is started and working like it should, but from the 2nd
on I get those "Lock obtain timed out"- exceptions.
This is the log output:

08:38:05,199 INFO  [IndexerManager] No indexer found for directory
D:\[mydir]\index1- starting new one!
08:38:05,199 INFO  [Indexer] Index: D:\[mydir]\index1 Lock file:
lucene-0ca7838f9396a636d1feda5aabb9b8db
08:38:05,215 INFO  [IndexerManager] New amount of Indexers: 1
08:38:05,215 INFO  [IndexerManager] No indexer found for directory
D:\[mydir]\index2- starting new one!
08:38:05,215 INFO  [Indexer] Index: D:\[mydir]\index2 Lock file:
lucene-cc9dfaabbf7ad61c4bb3af007b88288c
08:38:06,213 ERROR [IndexerManager] Lock obtain timed out:
NativeFSLock@C:\Dokumente und Einstellungen\[user]\Lokale
Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock
java.io.IOException: Lock obtain timed out: NativeFSLock@C:\Dokumente und
Einstellungen\[user]\Lokale
Einstellungen\Temp\lucene-fd415060ae453638d69faa9fa07fbe95-n-write.lock
	at org.apache.lucene.store.Lock.obtain(Lock.java:68)
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:257)
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:247)
	at de.innosystec.iar.indexing.Indexer.setUp(Indexer.java:101)
	at de.innosystec.iar.indexing.Indexer.<init>(Indexer.java:80)
	at
de.innosystec.iar.indexing.IndexerManager.addDocumentElement(IndexerManager.
java:228)
	at
de.innosystec.iar.parsing.ParserManager.indexDocumentElement(ParserManager.j
ava:286)
	at
de.innosystec.iar.parsing.ParserThread.startWorking(ParserThread.java:378)
	at de.innosystec.iar.parsing.ParserThread.run(ParserThread.java:175)
	at java.lang.Thread.run(Unknown Source)


The lock file mentioned in the exception is really created and used by the
first index. Seems like the FSDirectory.getLockID method doesn't work for
NativeFSLockFactory?
I'm using Win XP on my test platform.


Regards,
Frank


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: constructing smaller phrase queries given a multi-word query

Posted by Mekin Maheshwari <me...@gmail.com>.

Resending, with the hope that the search gurus missed this.

Would really appreciate any advise on this.
Would not want to reinvent the wheel & I am sure this is something
that would have been done.

Thanks,
mek

On 10/16/06, Mek <me...@gmail.com> wrote:
> Has anyone dealt with the problem of constructing sub-queries given a
> multi-word query ?
>
> Here is an example to illustrate what I mean:
>
> user queries for -> A B C D
> right now I change that query to "A B C D" A B C D to give phrase
> matches higher weightage.
>
> What might happen though, is that the user is looking for a document
> where "A B" in Field1 & "C D"  in Field2.
>
> So I should ideally be constructing the query as :
>
> "A B C D"^20 "A B"^10 "C D"^10 "B C D"^15  "A B C"^15 A B C D
>
> Has someone solved this problem, are there other ways to handle this problem ?
>
>
> Thanks,
> mek.
>


-- 
http://mekin.livejournal.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org