You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Martin Malmsten-2 <ma...@libris.kb.se> on 2005/09/05 00:24:26 UTC

SAME-opattor (possible newbie question)

Is there a way to tell Lucene to restrict proximity searches to just one field? This would mimic the BRS/Search SAME-operator, which I use very often.

For example, given this data:

author: a b c
author: d e f

a search for "a SAME c" would match the first row, but "a SAME d" would match nothing, which is what I want.

Is this possible?

Cheers,
  martin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SAME-opattor (possible newbie question)

Posted by Chris Hostetter <ho...@fucit.org>.

: > : For example, given this data:
: > :
: > : author: a b c
: > : author: d e f

: > : a search for "a SAME c" would match the first row, but "a SAME d"
: > would
: > : match nothing, which is what I want.

: No, both fields are in the same document. Which is also why proximity
: does not work.

: Or is there some way of telling a proximity query to not cross field
: boundaries?

you have to be careful about your terms.  In lucene, there isn't really
any notion of a "field boundary" unless you are talking about two fields
with different names.  If you create a document and add two
(indexed and tokenized) Field objects with the same field name, they are
treated the same as if they had been concatenated together (see the
javadocs for Document.add)

The good news is: as long as you've got some practical limits on the size
of your field values, you should be able to use a custom
Analyzer/TokenFilter to get the bahavior you want -- by creating a "magic
token" to seperate your individual values, and using a TokenFilter that
throws away these magic tokens when it seems them, but artificially bumps
up the positionIncriment for the next token it gets by some really large
ammount -- so that Phrase/SpanNear queries with with a slop less then that
amount will never cross your "boundary"

for example: if your source data has the following values for author...

   1) Napolean
   2) Terrence "The Man With Two Dynamite Brains" Winchester
   3) Hoss Man

... add that field as a single string value...

   Napolean ~AUTHOR~ Terrence "The Man With Two
   Dynamite Brains" Winchester ~AUTHOR~ Hoss Man

...and use an analyzer/tokenfilter that creates the following
token/position pairs...

   Napolean(0) Terrene(1000) The(1001) Man(1002) With(1003) Two(1004)
   Dynamite(1005) Brains(1006), Winchester(1007), Hoss(2007), Man(2008)

now as long as you use a slop less then 1000, searches for
author:"Hoss Man" and author:"Terrence Winchester" will return this
document, but a search for author:"Napolean Dynamite" will fail.



LIA has good info on writting your own analyzer/tokenfilter.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SAME-opattor (possible newbie question)

Posted by Martin Malmsten <Ma...@kb.se>.

> : For example, given this data:
> :
> : author: a b c
> : author: d e f
> :
> : a search for "a SAME c" would match the first row, but "a SAME d"  
> would
> : match nothing, which is what I want.
>
> if i understand you correctly, then you are describing a use case  
> in which
> the index has two documents, each containing a single field named  
> "author"
> one of which contains the tokens "a", "b", and "c" the other  
> containing
> "d", "e", and "f"
No, both fields are in the same document. Which is also why proximity  
does not work.

To give you some context: I am experimenting with replacing BRS/ 
Search with Lucene in one of our bliographic databases (aprox. 20M  
bibliographic records, in MARC21). Bibliographic data i notoriously  
tricky since it tends to contain lots of repeating fields and the  
fields themselves often contain lots of information between the data  
you want to match against.

Or is there some way of telling a proximity query to not cross field  
boundaries?

And no, I have no idea what an opattor is either ... ;)

martin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SAME-opattor (possible newbie question)

Posted by Chris Hostetter <ho...@fucit.org>.

: For example, given this data:
:
: author: a b c
: author: d e f
:
: a search for "a SAME c" would match the first row, but "a SAME d" would
: match nothing, which is what I want.

if i understand you correctly, then you are describing a use case in which
the index has two documents, each containing a single field named "author"
one of which contains the tokens "a", "b", and "c" the other containing
"d", "e", and "f"

In that case, a lucene query for "+author:a +author:c" would return the
first document, but a query for "+author:a +author:d" would return no
results.

You should take a look at the QueryParser syntax documentation to get an
idea of the way simple searching works -- but please keep in mind that
document just explains the types of queries that can be done using the
very basic Parser provided by default with Lucene -- the scope of searches
you can execute if you progromatically generate Query objects is much
larger...

	http://lucene.apache.org/java/docs/queryparsersyntax.html



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org