You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Luis Fco Ramirez Daza Glez <lu...@gmail.com> on 2010/03/15 03:10:26 UTC

Query question

Hi all

 

Hope somebody can help me with this.

I have an index with several fields that repeat in a document.

 

Sample doc:

Date-     2010-01-01

ID-          ASDF

Name-  John Lennon

Name-  Ringo Star

Name-  Paul McCartney

Name-  George Harrison

 

I want that the search return only the documents that have the terms in the same document filed, and NOT in the same document.

This query must return the above document:

Name:John + Name:Lennon

 

But this query should not return the above document:

Name:Paul + Name:Lennon

 

What I had to do was to create field with different names:

Name1-                John Lennon

Name2-                Ringo Star

Name3-                Paul McCartney

Name4-                George Harrison

 

But obviously this is not the cleanest solution, and building the queries is a pain, because it is complex and they can grow very fast if the documents have several fields:

(Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR (Name3:John + Name3:Lennon) OR (Name4:John + Name4:Lennon)

And the queries can become very large, the above query is for indexes that have 4 fields, some of ur indices can have hundreds of the same field.

 

Any suggestions are welcome.

 

Also what is de difference or benefit that you can have server fields in the same document with the same name, if you cannot treat them as if they where different fields, in that case I can just add all my terms to the same “Name” field, instead of having several “Name” fields??

 

Thanks

 

Luis

RE: Query question

Posted by Michael Garski <mg...@myspace-inc.com>.

Whoops - I misspoke - it's the PositionIncrementAttribute, not the OffsetAttribute that you would use in 2.9.

Michael

-----Original Message-----
From: Michael Garski 
Sent: Monday, March 15, 2010 9:09 AM
To: lucene-net-user@lucene.apache.org
Subject: RE: Query question

Luis,

As you mentioned previously rather than add spacing tokens such as 'xx' between the terms, you can use a custom analyzer to put a large gap in the position increment between the items you wish to keep together.

In this way you would have:

PositionInc 1   : John 
PositionInc 1   : Lennon
PositionInc 100 : Ringo
PositionInc 1   : Star
PositionInc 100 : Paul
PositionInc 1   : McCartney
PositionInc 100 : George
PositionInc 1   : Harrison

The value of the position increment is relative to the token that comes before it, so a value of 1 indicates it comes directly after the previous token and a value of 100 indicates there are 99 tokens between it and the previous token.  In that way you won't match on the phrase 'lennon ringo' with a slop value of less than 100.  The downside is that if you were to do a term-based Boolean query without spans you would get a hit on a query such as +field:john +field:paul.

It's fairly simple to do using TokenAttributes in 2.9 through the OffsetAttribute, or with previous versions by calling SetPositionIncrement on the token that is created in a tokenizer or comes through a TokenFilter.

FYI - this is the approach Solr takes with what is called 'multi-valued fields'.

Michael


-----Original Message-----
From: Luis Fco Ramirez Daza Glez [mailto:luis.francisco.rdg@gmail.com] 
Sent: Monday, March 15, 2010 7:52 AM
To: lucene-net-user@lucene.apache.org
Subject: RE: Query question

Hi Art

Thanks, that is a solution, and is easy to implement.
But unfortunately we are currently trying to optimize the performance of our indexing process and the performance of our searches, and that will add some garbage to our indices.

I'll try to set the gap for each field in an Analyzer and only use your suggestion as my last resort.

Best regards
Luis


> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Sunday, March 14, 2010 11:02 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Re: Query question
> 
> Luis,
> 
> what you can do is inject some dummy tokens between your "Names". The
> number
> of tokens should be grater then the slop. Here's what it would look
> like:
> 
> John Lennon xx Ringo Star xx Paul McCartney xx George Harrison
> 
> then, if your slop is 0, Lennon Ringo will not find anything. If your
> slop
> is 1, you'll need two dummy tokens (John Lennon xx xx Ringo Star)
> etc...
> 
> Art
> 
> 
> On Mon, Mar 15, 2010 at 3:43 PM, Luis Fco Ramirez Daza Glez <
> luis.francisco.rdg@gmail.com> wrote:
> 
> > Thanks Art
> >
> > Actually I'am trying to do that with SpanQuery.
> > It seems that I need to override getPositionIncrementGap in my
> analyzer so
> > that the queries do not match across the boundaries of different
> "Name"
> > fields, but I've found difficult to do it and to find an example.
> >
> > Regards,
> > Luis
> >
> > > -----Original Message-----
> > > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > > Sent: Sunday, March 14, 2010 10:13 PM
> > > To: lucene-net-user@lucene.apache.org
> > > Subject: Re: Query question
> > >
> > > Luis,
> > >
> > > Have you considered PhraseQuery or SpanQuery? Both of them would
> > > satisfy
> > > your requirement of finding "John Lennon" but not "Paul Lennon".
> > >
> > > Answering your last question about fields with the same name, they
> are
> > > not
> > > really different fields. Consider this:
> > >
> > > Adding three fields to the same document
> > >
> > > var fld1 = new Field(someName, "one" .... );
> > > var fld2 = new Field(someName, "two" .... );
> > > var fld3 = new Field(someName, "three" .... );
> > > doc.Add(fld1);
> > > doc.Add(fld2);
> > > doc.Add(fld3);
> > >
> > > is very similar to
> > >
> > > var fld1 = new Field(someName, "one two three" .... ), when using
> > > WhitespaceAnalyzer.
> > >
> > > Have you seen Luke (http://www.getopt.org/luke/)? It's a good way
> to
> > > understand what's happening inside your index.
> > >
> > > Regards,
> > > Art
> > >
> > >
> > > On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> > > luis.francisco.rdg@gmail.com> wrote:
> > >
> > > > Hi all
> > > >
> > > >
> > > >
> > > > Hope somebody can help me with this.
> > > >
> > > > I have an index with several fields that repeat in a document.
> > > >
> > > >
> > > >
> > > > Sample doc:
> > > >
> > > > Date-     2010-01-01
> > > >
> > > > ID-          ASDF
> > > >
> > > > Name-  John Lennon
> > > >
> > > > Name-  Ringo Star
> > > >
> > > > Name-  Paul McCartney
> > > >
> > > > Name-  George Harrison
> > > >
> > > >
> > > >
> > > > I want that the search return only the documents that have the
> terms
> > > in the
> > > > same document filed, and NOT in the same document.
> > > >
> > > > This query must return the above document:
> > > >
> > > > Name:John + Name:Lennon
> > > >
> > > >
> > > >
> > > > But this query should not return the above document:
> > > >
> > > > Name:Paul + Name:Lennon
> > > >
> > > >
> > > >
> > > > What I had to do was to create field with different names:
> > > >
> > > > Name1-                John Lennon
> > > >
> > > > Name2-                Ringo Star
> > > >
> > > > Name3-                Paul McCartney
> > > >
> > > > Name4-                George Harrison
> > > >
> > > >
> > > >
> > > > But obviously this is not the cleanest solution, and building the
> > > queries
> > > > is a pain, because it is complex and they can grow very fast if
> the
> > > > documents have several fields:
> > > >
> > > > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> > > (Name3:John +
> > > > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> > > >
> > > > And the queries can become very large, the above query is for
> indexes
> > > that
> > > > have 4 fields, some of ur indices can have hundreds of the same
> > > field.
> > > >
> > > >
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > >
> > > >
> > > > Also what is de difference or benefit that you can have server
> fields
> > > in
> > > > the same document with the same name, if you cannot treat them as
> if
> > > they
> > > > where different fields, in that case I can just add all my terms
> to
> > > the same
> > > > “Name” field, instead of having several “Name” fields??
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > Luis
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> >

RE: Query question

Posted by Digy <di...@gmail.com>.

Concatenating name and surname (and then lowercasing) with a custom analyzer?
like johnlennon ringostar  paulmccartney georgeharrison

DIGY

-----Original Message-----
From: Michael Garski [mailto:mgarski@myspace-inc.com] 
Sent: Monday, March 15, 2010 6:09 PM
To: lucene-net-user@lucene.apache.org
Subject: RE: Query question

Luis,

As you mentioned previously rather than add spacing tokens such as 'xx' between the terms, you can use a custom analyzer to put a large gap in the position increment between the items you wish to keep together.

In this way you would have:

PositionInc 1   : John 
PositionInc 1   : Lennon
PositionInc 100 : Ringo
PositionInc 1   : Star
PositionInc 100 : Paul
PositionInc 1   : McCartney
PositionInc 100 : George
PositionInc 1   : Harrison

The value of the position increment is relative to the token that comes before it, so a value of 1 indicates it comes directly after the previous token and a value of 100 indicates there are 99 tokens between it and the previous token.  In that way you won't match on the phrase 'lennon ringo' with a slop value of less than 100.  The downside is that if you were to do a term-based Boolean query without spans you would get a hit on a query such as +field:john +field:paul.

It's fairly simple to do using TokenAttributes in 2.9 through the OffsetAttribute, or with previous versions by calling SetPositionIncrement on the token that is created in a tokenizer or comes through a TokenFilter.

FYI - this is the approach Solr takes with what is called 'multi-valued fields'.

Michael


-----Original Message-----
From: Luis Fco Ramirez Daza Glez [mailto:luis.francisco.rdg@gmail.com] 
Sent: Monday, March 15, 2010 7:52 AM
To: lucene-net-user@lucene.apache.org
Subject: RE: Query question

Hi Art

Thanks, that is a solution, and is easy to implement.
But unfortunately we are currently trying to optimize the performance of our indexing process and the performance of our searches, and that will add some garbage to our indices.

I'll try to set the gap for each field in an Analyzer and only use your suggestion as my last resort.

Best regards
Luis


> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Sunday, March 14, 2010 11:02 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Re: Query question
> 
> Luis,
> 
> what you can do is inject some dummy tokens between your "Names". The
> number
> of tokens should be grater then the slop. Here's what it would look
> like:
> 
> John Lennon xx Ringo Star xx Paul McCartney xx George Harrison
> 
> then, if your slop is 0, Lennon Ringo will not find anything. If your
> slop
> is 1, you'll need two dummy tokens (John Lennon xx xx Ringo Star)
> etc...
> 
> Art
> 
> 
> On Mon, Mar 15, 2010 at 3:43 PM, Luis Fco Ramirez Daza Glez <
> luis.francisco.rdg@gmail.com> wrote:
> 
> > Thanks Art
> >
> > Actually I'am trying to do that with SpanQuery.
> > It seems that I need to override getPositionIncrementGap in my
> analyzer so
> > that the queries do not match across the boundaries of different
> "Name"
> > fields, but I've found difficult to do it and to find an example.
> >
> > Regards,
> > Luis
> >
> > > -----Original Message-----
> > > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > > Sent: Sunday, March 14, 2010 10:13 PM
> > > To: lucene-net-user@lucene.apache.org
> > > Subject: Re: Query question
> > >
> > > Luis,
> > >
> > > Have you considered PhraseQuery or SpanQuery? Both of them would
> > > satisfy
> > > your requirement of finding "John Lennon" but not "Paul Lennon".
> > >
> > > Answering your last question about fields with the same name, they
> are
> > > not
> > > really different fields. Consider this:
> > >
> > > Adding three fields to the same document
> > >
> > > var fld1 = new Field(someName, "one" .... );
> > > var fld2 = new Field(someName, "two" .... );
> > > var fld3 = new Field(someName, "three" .... );
> > > doc.Add(fld1);
> > > doc.Add(fld2);
> > > doc.Add(fld3);
> > >
> > > is very similar to
> > >
> > > var fld1 = new Field(someName, "one two three" .... ), when using
> > > WhitespaceAnalyzer.
> > >
> > > Have you seen Luke (http://www.getopt.org/luke/)? It's a good way
> to
> > > understand what's happening inside your index.
> > >
> > > Regards,
> > > Art
> > >
> > >
> > > On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> > > luis.francisco.rdg@gmail.com> wrote:
> > >
> > > > Hi all
> > > >
> > > >
> > > >
> > > > Hope somebody can help me with this.
> > > >
> > > > I have an index with several fields that repeat in a document.
> > > >
> > > >
> > > >
> > > > Sample doc:
> > > >
> > > > Date-     2010-01-01
> > > >
> > > > ID-          ASDF
> > > >
> > > > Name-  John Lennon
> > > >
> > > > Name-  Ringo Star
> > > >
> > > > Name-  Paul McCartney
> > > >
> > > > Name-  George Harrison
> > > >
> > > >
> > > >
> > > > I want that the search return only the documents that have the
> terms
> > > in the
> > > > same document filed, and NOT in the same document.
> > > >
> > > > This query must return the above document:
> > > >
> > > > Name:John + Name:Lennon
> > > >
> > > >
> > > >
> > > > But this query should not return the above document:
> > > >
> > > > Name:Paul + Name:Lennon
> > > >
> > > >
> > > >
> > > > What I had to do was to create field with different names:
> > > >
> > > > Name1-                John Lennon
> > > >
> > > > Name2-                Ringo Star
> > > >
> > > > Name3-                Paul McCartney
> > > >
> > > > Name4-                George Harrison
> > > >
> > > >
> > > >
> > > > But obviously this is not the cleanest solution, and building the
> > > queries
> > > > is a pain, because it is complex and they can grow very fast if
> the
> > > > documents have several fields:
> > > >
> > > > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> > > (Name3:John +
> > > > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> > > >
> > > > And the queries can become very large, the above query is for
> indexes
> > > that
> > > > have 4 fields, some of ur indices can have hundreds of the same
> > > field.
> > > >
> > > >
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > >
> > > >
> > > > Also what is de difference or benefit that you can have server
> fields
> > > in
> > > > the same document with the same name, if you cannot treat them as
> if
> > > they
> > > > where different fields, in that case I can just add all my terms
> to
> > > the same
> > > > “Name” field, instead of having several “Name” fields??
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > Luis
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> >

RE: Query question

Posted by Michael Garski <mg...@myspace-inc.com>.

Luis,

As you mentioned previously rather than add spacing tokens such as 'xx' between the terms, you can use a custom analyzer to put a large gap in the position increment between the items you wish to keep together.

In this way you would have:

PositionInc 1   : John 
PositionInc 1   : Lennon
PositionInc 100 : Ringo
PositionInc 1   : Star
PositionInc 100 : Paul
PositionInc 1   : McCartney
PositionInc 100 : George
PositionInc 1   : Harrison

The value of the position increment is relative to the token that comes before it, so a value of 1 indicates it comes directly after the previous token and a value of 100 indicates there are 99 tokens between it and the previous token.  In that way you won't match on the phrase 'lennon ringo' with a slop value of less than 100.  The downside is that if you were to do a term-based Boolean query without spans you would get a hit on a query such as +field:john +field:paul.

It's fairly simple to do using TokenAttributes in 2.9 through the OffsetAttribute, or with previous versions by calling SetPositionIncrement on the token that is created in a tokenizer or comes through a TokenFilter.

FYI - this is the approach Solr takes with what is called 'multi-valued fields'.

Michael


-----Original Message-----
From: Luis Fco Ramirez Daza Glez [mailto:luis.francisco.rdg@gmail.com] 
Sent: Monday, March 15, 2010 7:52 AM
To: lucene-net-user@lucene.apache.org
Subject: RE: Query question

Hi Art

Thanks, that is a solution, and is easy to implement.
But unfortunately we are currently trying to optimize the performance of our indexing process and the performance of our searches, and that will add some garbage to our indices.

I'll try to set the gap for each field in an Analyzer and only use your suggestion as my last resort.

Best regards
Luis


> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Sunday, March 14, 2010 11:02 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Re: Query question
> 
> Luis,
> 
> what you can do is inject some dummy tokens between your "Names". The
> number
> of tokens should be grater then the slop. Here's what it would look
> like:
> 
> John Lennon xx Ringo Star xx Paul McCartney xx George Harrison
> 
> then, if your slop is 0, Lennon Ringo will not find anything. If your
> slop
> is 1, you'll need two dummy tokens (John Lennon xx xx Ringo Star)
> etc...
> 
> Art
> 
> 
> On Mon, Mar 15, 2010 at 3:43 PM, Luis Fco Ramirez Daza Glez <
> luis.francisco.rdg@gmail.com> wrote:
> 
> > Thanks Art
> >
> > Actually I'am trying to do that with SpanQuery.
> > It seems that I need to override getPositionIncrementGap in my
> analyzer so
> > that the queries do not match across the boundaries of different
> "Name"
> > fields, but I've found difficult to do it and to find an example.
> >
> > Regards,
> > Luis
> >
> > > -----Original Message-----
> > > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > > Sent: Sunday, March 14, 2010 10:13 PM
> > > To: lucene-net-user@lucene.apache.org
> > > Subject: Re: Query question
> > >
> > > Luis,
> > >
> > > Have you considered PhraseQuery or SpanQuery? Both of them would
> > > satisfy
> > > your requirement of finding "John Lennon" but not "Paul Lennon".
> > >
> > > Answering your last question about fields with the same name, they
> are
> > > not
> > > really different fields. Consider this:
> > >
> > > Adding three fields to the same document
> > >
> > > var fld1 = new Field(someName, "one" .... );
> > > var fld2 = new Field(someName, "two" .... );
> > > var fld3 = new Field(someName, "three" .... );
> > > doc.Add(fld1);
> > > doc.Add(fld2);
> > > doc.Add(fld3);
> > >
> > > is very similar to
> > >
> > > var fld1 = new Field(someName, "one two three" .... ), when using
> > > WhitespaceAnalyzer.
> > >
> > > Have you seen Luke (http://www.getopt.org/luke/)? It's a good way
> to
> > > understand what's happening inside your index.
> > >
> > > Regards,
> > > Art
> > >
> > >
> > > On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> > > luis.francisco.rdg@gmail.com> wrote:
> > >
> > > > Hi all
> > > >
> > > >
> > > >
> > > > Hope somebody can help me with this.
> > > >
> > > > I have an index with several fields that repeat in a document.
> > > >
> > > >
> > > >
> > > > Sample doc:
> > > >
> > > > Date-     2010-01-01
> > > >
> > > > ID-          ASDF
> > > >
> > > > Name-  John Lennon
> > > >
> > > > Name-  Ringo Star
> > > >
> > > > Name-  Paul McCartney
> > > >
> > > > Name-  George Harrison
> > > >
> > > >
> > > >
> > > > I want that the search return only the documents that have the
> terms
> > > in the
> > > > same document filed, and NOT in the same document.
> > > >
> > > > This query must return the above document:
> > > >
> > > > Name:John + Name:Lennon
> > > >
> > > >
> > > >
> > > > But this query should not return the above document:
> > > >
> > > > Name:Paul + Name:Lennon
> > > >
> > > >
> > > >
> > > > What I had to do was to create field with different names:
> > > >
> > > > Name1-                John Lennon
> > > >
> > > > Name2-                Ringo Star
> > > >
> > > > Name3-                Paul McCartney
> > > >
> > > > Name4-                George Harrison
> > > >
> > > >
> > > >
> > > > But obviously this is not the cleanest solution, and building the
> > > queries
> > > > is a pain, because it is complex and they can grow very fast if
> the
> > > > documents have several fields:
> > > >
> > > > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> > > (Name3:John +
> > > > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> > > >
> > > > And the queries can become very large, the above query is for
> indexes
> > > that
> > > > have 4 fields, some of ur indices can have hundreds of the same
> > > field.
> > > >
> > > >
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > >
> > > >
> > > > Also what is de difference or benefit that you can have server
> fields
> > > in
> > > > the same document with the same name, if you cannot treat them as
> if
> > > they
> > > > where different fields, in that case I can just add all my terms
> to
> > > the same
> > > > “Name” field, instead of having several “Name” fields??
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > Luis
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> >

RE: Query question

Posted by Luis Fco Ramirez Daza Glez <lu...@gmail.com>.

Hi Art

Thanks, that is a solution, and is easy to implement.
But unfortunately we are currently trying to optimize the performance of our indexing process and the performance of our searches, and that will add some garbage to our indices.

I'll try to set the gap for each field in an Analyzer and only use your suggestion as my last resort.

Best regards
Luis


> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Sunday, March 14, 2010 11:02 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Re: Query question
> 
> Luis,
> 
> what you can do is inject some dummy tokens between your "Names". The
> number
> of tokens should be grater then the slop. Here's what it would look
> like:
> 
> John Lennon xx Ringo Star xx Paul McCartney xx George Harrison
> 
> then, if your slop is 0, Lennon Ringo will not find anything. If your
> slop
> is 1, you'll need two dummy tokens (John Lennon xx xx Ringo Star)
> etc...
> 
> Art
> 
> 
> On Mon, Mar 15, 2010 at 3:43 PM, Luis Fco Ramirez Daza Glez <
> luis.francisco.rdg@gmail.com> wrote:
> 
> > Thanks Art
> >
> > Actually I'am trying to do that with SpanQuery.
> > It seems that I need to override getPositionIncrementGap in my
> analyzer so
> > that the queries do not match across the boundaries of different
> "Name"
> > fields, but I've found difficult to do it and to find an example.
> >
> > Regards,
> > Luis
> >
> > > -----Original Message-----
> > > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > > Sent: Sunday, March 14, 2010 10:13 PM
> > > To: lucene-net-user@lucene.apache.org
> > > Subject: Re: Query question
> > >
> > > Luis,
> > >
> > > Have you considered PhraseQuery or SpanQuery? Both of them would
> > > satisfy
> > > your requirement of finding "John Lennon" but not "Paul Lennon".
> > >
> > > Answering your last question about fields with the same name, they
> are
> > > not
> > > really different fields. Consider this:
> > >
> > > Adding three fields to the same document
> > >
> > > var fld1 = new Field(someName, "one" .... );
> > > var fld2 = new Field(someName, "two" .... );
> > > var fld3 = new Field(someName, "three" .... );
> > > doc.Add(fld1);
> > > doc.Add(fld2);
> > > doc.Add(fld3);
> > >
> > > is very similar to
> > >
> > > var fld1 = new Field(someName, "one two three" .... ), when using
> > > WhitespaceAnalyzer.
> > >
> > > Have you seen Luke (http://www.getopt.org/luke/)? It's a good way
> to
> > > understand what's happening inside your index.
> > >
> > > Regards,
> > > Art
> > >
> > >
> > > On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> > > luis.francisco.rdg@gmail.com> wrote:
> > >
> > > > Hi all
> > > >
> > > >
> > > >
> > > > Hope somebody can help me with this.
> > > >
> > > > I have an index with several fields that repeat in a document.
> > > >
> > > >
> > > >
> > > > Sample doc:
> > > >
> > > > Date-     2010-01-01
> > > >
> > > > ID-          ASDF
> > > >
> > > > Name-  John Lennon
> > > >
> > > > Name-  Ringo Star
> > > >
> > > > Name-  Paul McCartney
> > > >
> > > > Name-  George Harrison
> > > >
> > > >
> > > >
> > > > I want that the search return only the documents that have the
> terms
> > > in the
> > > > same document filed, and NOT in the same document.
> > > >
> > > > This query must return the above document:
> > > >
> > > > Name:John + Name:Lennon
> > > >
> > > >
> > > >
> > > > But this query should not return the above document:
> > > >
> > > > Name:Paul + Name:Lennon
> > > >
> > > >
> > > >
> > > > What I had to do was to create field with different names:
> > > >
> > > > Name1-                John Lennon
> > > >
> > > > Name2-                Ringo Star
> > > >
> > > > Name3-                Paul McCartney
> > > >
> > > > Name4-                George Harrison
> > > >
> > > >
> > > >
> > > > But obviously this is not the cleanest solution, and building the
> > > queries
> > > > is a pain, because it is complex and they can grow very fast if
> the
> > > > documents have several fields:
> > > >
> > > > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> > > (Name3:John +
> > > > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> > > >
> > > > And the queries can become very large, the above query is for
> indexes
> > > that
> > > > have 4 fields, some of ur indices can have hundreds of the same
> > > field.
> > > >
> > > >
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > >
> > > >
> > > > Also what is de difference or benefit that you can have server
> fields
> > > in
> > > > the same document with the same name, if you cannot treat them as
> if
> > > they
> > > > where different fields, in that case I can just add all my terms
> to
> > > the same
> > > > “Name” field, instead of having several “Name” fields??
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > Luis
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> >

Re: Query question

Posted by Artem Chereisky <a....@gmail.com>.

Luis,

what you can do is inject some dummy tokens between your "Names". The number
of tokens should be grater then the slop. Here's what it would look like:

John Lennon xx Ringo Star xx Paul McCartney xx George Harrison

then, if your slop is 0, Lennon Ringo will not find anything. If your slop
is 1, you'll need two dummy tokens (John Lennon xx xx Ringo Star) etc...

Art


On Mon, Mar 15, 2010 at 3:43 PM, Luis Fco Ramirez Daza Glez <
luis.francisco.rdg@gmail.com> wrote:

> Thanks Art
>
> Actually I'am trying to do that with SpanQuery.
> It seems that I need to override getPositionIncrementGap in my analyzer so
> that the queries do not match across the boundaries of different "Name"
> fields, but I've found difficult to do it and to find an example.
>
> Regards,
> Luis
>
> > -----Original Message-----
> > From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> > Sent: Sunday, March 14, 2010 10:13 PM
> > To: lucene-net-user@lucene.apache.org
> > Subject: Re: Query question
> >
> > Luis,
> >
> > Have you considered PhraseQuery or SpanQuery? Both of them would
> > satisfy
> > your requirement of finding "John Lennon" but not "Paul Lennon".
> >
> > Answering your last question about fields with the same name, they are
> > not
> > really different fields. Consider this:
> >
> > Adding three fields to the same document
> >
> > var fld1 = new Field(someName, "one" .... );
> > var fld2 = new Field(someName, "two" .... );
> > var fld3 = new Field(someName, "three" .... );
> > doc.Add(fld1);
> > doc.Add(fld2);
> > doc.Add(fld3);
> >
> > is very similar to
> >
> > var fld1 = new Field(someName, "one two three" .... ), when using
> > WhitespaceAnalyzer.
> >
> > Have you seen Luke (http://www.getopt.org/luke/)? It's a good way to
> > understand what's happening inside your index.
> >
> > Regards,
> > Art
> >
> >
> > On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> > luis.francisco.rdg@gmail.com> wrote:
> >
> > > Hi all
> > >
> > >
> > >
> > > Hope somebody can help me with this.
> > >
> > > I have an index with several fields that repeat in a document.
> > >
> > >
> > >
> > > Sample doc:
> > >
> > > Date-     2010-01-01
> > >
> > > ID-          ASDF
> > >
> > > Name-  John Lennon
> > >
> > > Name-  Ringo Star
> > >
> > > Name-  Paul McCartney
> > >
> > > Name-  George Harrison
> > >
> > >
> > >
> > > I want that the search return only the documents that have the terms
> > in the
> > > same document filed, and NOT in the same document.
> > >
> > > This query must return the above document:
> > >
> > > Name:John + Name:Lennon
> > >
> > >
> > >
> > > But this query should not return the above document:
> > >
> > > Name:Paul + Name:Lennon
> > >
> > >
> > >
> > > What I had to do was to create field with different names:
> > >
> > > Name1-                John Lennon
> > >
> > > Name2-                Ringo Star
> > >
> > > Name3-                Paul McCartney
> > >
> > > Name4-                George Harrison
> > >
> > >
> > >
> > > But obviously this is not the cleanest solution, and building the
> > queries
> > > is a pain, because it is complex and they can grow very fast if the
> > > documents have several fields:
> > >
> > > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> > (Name3:John +
> > > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> > >
> > > And the queries can become very large, the above query is for indexes
> > that
> > > have 4 fields, some of ur indices can have hundreds of the same
> > field.
> > >
> > >
> > >
> > > Any suggestions are welcome.
> > >
> > >
> > >
> > > Also what is de difference or benefit that you can have server fields
> > in
> > > the same document with the same name, if you cannot treat them as if
> > they
> > > where different fields, in that case I can just add all my terms to
> > the same
> > > “Name” field, instead of having several “Name” fields??
> > >
> > >
> > >
> > > Thanks
> > >
> > >
> > >
> > > Luis
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
>
>

RE: Query question

Posted by Luis Fco Ramirez Daza Glez <lu...@gmail.com>.

Thanks Art

Actually I'am trying to do that with SpanQuery.
It seems that I need to override getPositionIncrementGap in my analyzer so that the queries do not match across the boundaries of different "Name" fields, but I've found difficult to do it and to find an example.

Regards,
Luis

> -----Original Message-----
> From: Artem Chereisky [mailto:a.chereisky@gmail.com]
> Sent: Sunday, March 14, 2010 10:13 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Re: Query question
> 
> Luis,
> 
> Have you considered PhraseQuery or SpanQuery? Both of them would
> satisfy
> your requirement of finding "John Lennon" but not "Paul Lennon".
> 
> Answering your last question about fields with the same name, they are
> not
> really different fields. Consider this:
> 
> Adding three fields to the same document
> 
> var fld1 = new Field(someName, "one" .... );
> var fld2 = new Field(someName, "two" .... );
> var fld3 = new Field(someName, "three" .... );
> doc.Add(fld1);
> doc.Add(fld2);
> doc.Add(fld3);
> 
> is very similar to
> 
> var fld1 = new Field(someName, "one two three" .... ), when using
> WhitespaceAnalyzer.
> 
> Have you seen Luke (http://www.getopt.org/luke/)? It's a good way to
> understand what's happening inside your index.
> 
> Regards,
> Art
> 
> 
> On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
> luis.francisco.rdg@gmail.com> wrote:
> 
> > Hi all
> >
> >
> >
> > Hope somebody can help me with this.
> >
> > I have an index with several fields that repeat in a document.
> >
> >
> >
> > Sample doc:
> >
> > Date-     2010-01-01
> >
> > ID-          ASDF
> >
> > Name-  John Lennon
> >
> > Name-  Ringo Star
> >
> > Name-  Paul McCartney
> >
> > Name-  George Harrison
> >
> >
> >
> > I want that the search return only the documents that have the terms
> in the
> > same document filed, and NOT in the same document.
> >
> > This query must return the above document:
> >
> > Name:John + Name:Lennon
> >
> >
> >
> > But this query should not return the above document:
> >
> > Name:Paul + Name:Lennon
> >
> >
> >
> > What I had to do was to create field with different names:
> >
> > Name1-                John Lennon
> >
> > Name2-                Ringo Star
> >
> > Name3-                Paul McCartney
> >
> > Name4-                George Harrison
> >
> >
> >
> > But obviously this is not the cleanest solution, and building the
> queries
> > is a pain, because it is complex and they can grow very fast if the
> > documents have several fields:
> >
> > (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR
> (Name3:John +
> > Name3:Lennon) OR (Name4:John + Name4:Lennon)
> >
> > And the queries can become very large, the above query is for indexes
> that
> > have 4 fields, some of ur indices can have hundreds of the same
> field.
> >
> >
> >
> > Any suggestions are welcome.
> >
> >
> >
> > Also what is de difference or benefit that you can have server fields
> in
> > the same document with the same name, if you cannot treat them as if
> they
> > where different fields, in that case I can just add all my terms to
> the same
> > “Name” field, instead of having several “Name” fields??
> >
> >
> >
> > Thanks
> >
> >
> >
> > Luis
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >

Re: Query question

Posted by Artem Chereisky <a....@gmail.com>.

Luis,

Have you considered PhraseQuery or SpanQuery? Both of them would satisfy
your requirement of finding "John Lennon" but not "Paul Lennon".

Answering your last question about fields with the same name, they are not
really different fields. Consider this:

Adding three fields to the same document

var fld1 = new Field(someName, "one" .... );
var fld2 = new Field(someName, "two" .... );
var fld3 = new Field(someName, "three" .... );
doc.Add(fld1);
doc.Add(fld2);
doc.Add(fld3);

is very similar to

var fld1 = new Field(someName, "one two three" .... ), when using
WhitespaceAnalyzer.

Have you seen Luke (http://www.getopt.org/luke/)? It's a good way to
understand what's happening inside your index.

Regards,
Art


On Mon, Mar 15, 2010 at 1:10 PM, Luis Fco Ramirez Daza Glez <
luis.francisco.rdg@gmail.com> wrote:

> Hi all
>
>
>
> Hope somebody can help me with this.
>
> I have an index with several fields that repeat in a document.
>
>
>
> Sample doc:
>
> Date-     2010-01-01
>
> ID-          ASDF
>
> Name-  John Lennon
>
> Name-  Ringo Star
>
> Name-  Paul McCartney
>
> Name-  George Harrison
>
>
>
> I want that the search return only the documents that have the terms in the
> same document filed, and NOT in the same document.
>
> This query must return the above document:
>
> Name:John + Name:Lennon
>
>
>
> But this query should not return the above document:
>
> Name:Paul + Name:Lennon
>
>
>
> What I had to do was to create field with different names:
>
> Name1-                John Lennon
>
> Name2-                Ringo Star
>
> Name3-                Paul McCartney
>
> Name4-                George Harrison
>
>
>
> But obviously this is not the cleanest solution, and building the queries
> is a pain, because it is complex and they can grow very fast if the
> documents have several fields:
>
> (Name:John + Name:Lennon) OR (Name2:John + Name2:Lennon) OR (Name3:John +
> Name3:Lennon) OR (Name4:John + Name4:Lennon)
>
> And the queries can become very large, the above query is for indexes that
> have 4 fields, some of ur indices can have hundreds of the same field.
>
>
>
> Any suggestions are welcome.
>
>
>
> Also what is de difference or benefit that you can have server fields in
> the same document with the same name, if you cannot treat them as if they
> where different fields, in that case I can just add all my terms to the same
> “Name” field, instead of having several “Name” fields??
>
>
>
> Thanks
>
>
>
> Luis
>
>
>
>
>
>
>
>
>
>
>
>
>
>