You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2007/02/21 12:11:28 UTC

Positions in SpanFirst

Hi,

I have a field to which I add several bits of information, e.g.

doc.add(new Field("x", "first bit"));
doc.add(new Field("x", "second part"));
doc.add(new Field("x", "third section"));

I am using SpanFirstQuery to search them with something like:

while...
   SpanTermQuery stquery = new SpanTermQuery(new Term("x", termStr[incFactor]));
   query = new SpanFirstQuery(stquery, incFactor);
   incFactor++

but a search for

"first", span pos 1
"bit", span pos 2

gets a match, but

"second", span pos 1
"part", span pos 2

fails.  How can I get the first term position for each word in each Field added 
to the document for the same field name to be 1, so that the SpanFirst works.

Antony






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Chris Hostetter <ho...@fucit.org>.

: so I thought that sounded good, but there does not seem to be a way to set it
: and most of the Analyzers just seem to use the base Analyzer method which
: returns 0, so I'm now confused as to what this actually does in practice.

by default all the analyzers return 0, but you can subclass any analyzer
(or anonymously subclass it) or write your own wrapper analyzer that
impliments it to return whatever you want.

: That's a good point.  The field is used to index mail recipients and currently
: has a "starts with" search (non Lucene implementation).  Unless I can set the
: position increment gap, it is only ever possible to search for the first indexed
: recipient with proxity queries.\

search the archives for info on "boundary" searching ... sentence or
paragraph or page ... the basic mechanism that can be used to get "starts
with" type queries where your definition of starts with doesn't match up
with the actual first term in the field is to to put a marker token in at
the begining of each value and then to a SpanNear query that starts with
it.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Antony Bowesman <ad...@teamware.com>.

Ahh, now it falls into place.
Thanks
Antony


Chris Hostetter wrote:
> it's not called Analyzer.getPositionAfterGap .. it's
> Analyzer.getPositionIncrementGap .. it's the Position Increment used when
> there is a Gap -- so returning 0 means that no exra increment is used, and
> multiple values are treated just like one long stream of tokens (each
> token has a position of 1 greater then the token before it).



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Chris Hostetter <ho...@fucit.org>.

: So, if you can add 1000, shouldn't setting 0 each time cause it to start at 0
: each time?  The default Analyzer.getPositionIncrementGap always returns 0.

it's not called Analyzer.getPositionAfterGap .. it's
Analyzer.getPositionIncrementGap .. it's the Position Increment used when
there is a Gap -- so returning 0 means that no exra increment is used, and
multiple values are treated just like one long stream of tokens (each
token has a position of 1 greater then the token before it).




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Antony Bowesman <ad...@teamware.com>.

Chris Hostetter wrote:
> : So I don't see why using a SpanNear that respects order and a large
> : IncrementGap won't solve your problem...... Although it would return "odd"
> 
> i think the use case he's worreid about is that he needs to be able to
> find matches just on the "start" of a persons name, ie...
> 
> 	Email#1 To: Jim Bob; Sue Anne-Marie Brown; John Doe
> 	Email#2 To: Tom Smith; Bob Jones; John Doe
> 
> ...he needs to support existing semantics that let him say "find emails
> where the start of a persons name is 'bob' and it returns Email#2, but not
> email#1 .. hence his interest in SpanFirst -- he wants to match the
> "first" few tokens of a "value" added to a field (which isn't waht
> SpanFirst does)

Correct.  That's one of the current search mechanisms.  It's not a major issue 
and I think I'll probably end up ducking it on the basis that the system 
directory defaults to a surname/firstname name order, but of course there's no 
guarantee that mail from other systems will have those names in that order, e.g.

#1 To: Bowesman Antony
#2 To: Antony Bowesman

makes this 'starts with' feature less useful.

Thanks again Hoss and Erick for the suggestions!  This list is excellent and I 
do wonder if you guys actually have day jobs that you are able to do as well as 
give the amount of time you seem to on this list!

Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Chris Hostetter <ho...@fucit.org>.

: So I don't see why using a SpanNear that respects order and a large
: IncrementGap won't solve your problem...... Although it would return "odd"

i think the use case he's worreid about is that he needs to be able to
find matches just on the "start" of a persons name, ie...

	Email#1 To: Jim Bob; Sue Anne-Marie Brown; John Doe
	Email#2 To: Tom Smith; Bob Jones; John Doe

...he needs to support existing semantics that let him say "find emails
where the start of a persons name is 'bob' and it returns Email#2, but not
email#1 .. hence his interest in SpanFirst -- he wants to match the
"first" few tokens of a "value" added to a field (which isn't waht
SpanFirst does)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Erick Erickson <er...@gmail.com>.

I really think you need to stop obsessing on SpanFirst <G>. I suspect that
this is leading you down an unrewarding path.

So I don't see why using a SpanNear that respects order and a large
IncrementGap won't solve your problem...... Although it would return "odd"
matches. Let's say you indexed "first second third" as one name and then
searched on a SpanNear of second and third with a slop of 100. You'd get a
match on a middle and last name rather than a first and last name..... But I
wonder if this can be tolerated given all the new capabilities you'll
doubtless be adding <G>.

Best
Erick

On 2/21/07, Antony Bowesman <ad...@teamware.com> wrote:
>
> Hi Erick,
>
> > What this does is allow you to put gaps between successive sets of terms
> > indexed in the same field. For instance...
> > doc.add("field", "some stuff");
> > doc.add("field", "bunch hooey");
> > doc.add("field", "what is this");
> > writer.add(doc);
> >
> > In this case, there would be the following positions, assuming that the
> > IncrementGap was 1000....
> > some 0
> > stuff 1
> > bunch 1002
> > hooey 1003
> > what 2004
> > is 2005
> > this 2006
>
> So, if you can add 1000, shouldn't setting 0 each time cause it to start
> at 0
> each time?  The default Analyzer.getPositionIncrementGap always returns 0.
>
> >> That's a good point.  The field is used to index mail recipients and
> >> currently
> >> has a "starts with" search (non Lucene implementation).  Unless I can
> set
> >> the
> >> position increment gap, it is only ever possible to search for the
> first
> >> indexed
> >> recipient with proxity queries.\
> >
> >
> > This is confusing me. You can easily use proximity queries with the
> above
> > scenario. For instance, searching for "bunch hooey"~4 would generate a
> hit.
> > As would "bunch hooey"~10000. But "some this"~10 would not generate a
> hit.
> > Whether that does what you need is another question <G>... So it's time
> to
> > ask "what are you really trying to do?" In other words, what behavior
> are
> > you trying to mimic from the old code? It's not clear to me what the
> > behavior you need is. It'd help if you gave a concrete example of the
> raw
> > data, and what you want returned...
>
> You example is good enough, just assume they are people's names :)  I know
> I had
> a mail from Mrs Bunch Ogilvy, so I want to do a "starts with", i.e.
> SpanFirst
> for bunch, so I find all the first name bunches.
>
> > In your first example, using the above scheme, you'd get hits (using
> > SpanNear rather than SpanFirst) if you searched on
> > "first bit" in a SpanNear query with a slop of 2. You'd also get a hit
> if
> > you searched on
> > "second part" in a SpanNear with a slop of 2. Does this mimic the
> behavior
> > you need?
>
> No, SpanNear is fine, but SpanFirst will not work as there always has to
> be a
> starting offset.  I can't search "bunch hooey" as SpanFirst unless I know
> that
> it was indexed as the second 'group' and therefore set the starting span
> position as 1002.
>
> Using Lucene has added a whole world of new search possibilities to the
> product,
> but when people have been using something a certain way for 15 years, it
> can be
> difficult to shift their expectations :)  There's always someone who will
> shout...
>
> Antony
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Positions in SpanFirst

Posted by Antony Bowesman <ad...@teamware.com>.

Hi Erick,

> What this does is allow you to put gaps between successive sets of terms
> indexed in the same field. For instance...
> doc.add("field", "some stuff");
> doc.add("field", "bunch hooey");
> doc.add("field", "what is this");
> writer.add(doc);
> 
> In this case, there would be the following positions, assuming that the
> IncrementGap was 1000....
> some 0
> stuff 1
> bunch 1002
> hooey 1003
> what 2004
> is 2005
> this 2006

So, if you can add 1000, shouldn't setting 0 each time cause it to start at 0 
each time?  The default Analyzer.getPositionIncrementGap always returns 0.

>> That's a good point.  The field is used to index mail recipients and
>> currently
>> has a "starts with" search (non Lucene implementation).  Unless I can set
>> the
>> position increment gap, it is only ever possible to search for the first
>> indexed
>> recipient with proxity queries.\
> 
> 
> This is confusing me. You can easily use proximity queries with the above
> scenario. For instance, searching for "bunch hooey"~4 would generate a hit.
> As would "bunch hooey"~10000. But "some this"~10 would not generate a hit.
> Whether that does what you need is another question <G>... So it's time to
> ask "what are you really trying to do?" In other words, what behavior are
> you trying to mimic from the old code? It's not clear to me what the
> behavior you need is. It'd help if you gave a concrete example of the raw
> data, and what you want returned...

You example is good enough, just assume they are people's names :)  I know I had 
a mail from Mrs Bunch Ogilvy, so I want to do a "starts with", i.e. SpanFirst 
for bunch, so I find all the first name bunches.

> In your first example, using the above scheme, you'd get hits (using
> SpanNear rather than SpanFirst) if you searched on
> "first bit" in a SpanNear query with a slop of 2. You'd also get a hit if
> you searched on
> "second part" in a SpanNear with a slop of 2. Does this mimic the behavior
> you need?

No, SpanNear is fine, but SpanFirst will not work as there always has to be a 
starting offset.  I can't search "bunch hooey" as SpanFirst unless I know that 
it was indexed as the second 'group' and therefore set the starting span 
position as 1002.

Using Lucene has added a whole world of new search possibilities to the product, 
but when people have been using something a certain way for 15 years, it can be 
difficult to shift their expectations :)  There's always someone who will shout...

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Erick Erickson <er...@gmail.com>.

See below..

On 2/21/07, Antony Bowesman <ad...@teamware.com> wrote:
>
> Hi Erick,
>
> > I'm not sure you can, since all the interfaces I use alter the increment
> > between successive terms, but I'll be the first to admit that there are
> > many
> > nooks and crannies that I don't know about... But I suspect that a
> negative
> > increment is not supported intentionally....
>
> I read your other interesting post about omitting termvector info and this
> led
> me to find Analyzer.getPositionIncrementGap.  The javadocs state
>
> "Invoked before indexing a Field instance if terms have already been added
> to
> that field..."
>
> so I thought that sounded good, but there does not seem to be a way to set
> it
> and most of the Analyzers just seem to use the base Analyzer method which
> returns 0, so I'm now confused as to what this actually does in practice.

What this does is allow you to put gaps between successive sets of terms
indexed in the same field. For instance...
doc.add("field", "some stuff");
doc.add("field", "bunch hooey");
doc.add("field", "what is this");
writer.add(doc);

In this case, there would be the following positions, assuming that the
IncrementGap was 1000....
some 0
stuff 1
bunch 1002
hooey 1003
what 2004
is 2005
this 2006

It was a little hard to get my head around. The purpose is to be able to
increment things in a single field in a document, but have some sense of
grouping.

> But I really doubt you want to do this due to the consequences. Consider
> in
> > your example the terms would have the following offsets
> > first 0
> > bit 1
> > second 0
> > part 1
> > third 0
> > section 1
> >
> > Now think about a proximity query "first section"~1. This would produce
> a
> > hit because you've changed the whole sense of what offsets mean. Is this
> > really a good thing?
>
> That's a good point.  The field is used to index mail recipients and
> currently
> has a "starts with" search (non Lucene implementation).  Unless I can set
> the
> position increment gap, it is only ever possible to search for the first
> indexed
> recipient with proxity queries.\

This is confusing me. You can easily use proximity queries with the above
scenario. For instance, searching for "bunch hooey"~4 would generate a hit.
As would "bunch hooey"~10000. But "some this"~10 would not generate a hit.
Whether that does what you need is another question <G>... So it's time to
ask "what are you really trying to do?" In other words, what behavior are
you trying to mimic from the old code? It's not clear to me what the
behavior you need is. It'd help if you gave a concrete example of the raw
data, and what you want returned...

In your first example, using the above scheme, you'd get hits (using
SpanNear rather than SpanFirst) if you searched on
"first bit" in a SpanNear query with a slop of 2. You'd also get a hit if
you searched on
"second part" in a SpanNear with a slop of 2. Does this mimic the behavior
you need?

NOTE:, my "first bit" with slop shorthand above would actually be
constructed by instantiating a SpanNear query with two SpanTermQuerys in the
consctructor....

Best
Erick

I'm trying to ensure the Lucene implementation provides at least the
> original
> functionality.  If I can't achieve it I can just document the
> limitation.  If I
> can, I may get false hits, but I still have the choice to filter the hits
> and
> weed out the false ones before being given to the client.  It's not a
> showstopper, it would be good it it could be done.
>
> Thanks
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Positions in SpanFirst

Posted by Antony Bowesman <ad...@teamware.com>.

Hi Erick,

> I'm not sure you can, since all the interfaces I use alter the increment
> between successive terms, but I'll be the first to admit that there are 
> many
> nooks and crannies that I don't know about... But I suspect that a negative
> increment is not supported intentionally....

I read your other interesting post about omitting termvector info and this led 
me to find Analyzer.getPositionIncrementGap.  The javadocs state

"Invoked before indexing a Field instance if terms have already been added to 
that field..."

so I thought that sounded good, but there does not seem to be a way to set it 
and most of the Analyzers just seem to use the base Analyzer method which 
returns 0, so I'm now confused as to what this actually does in practice.

> But I really doubt you want to do this due to the consequences. Consider in
> your example the terms would have the following offsets
> first 0
> bit 1
> second 0
> part 1
> third 0
> section 1
> 
> Now think about a proximity query "first section"~1. This would produce a
> hit because you've changed the whole sense of what offsets mean. Is this
> really a good thing?

That's a good point.  The field is used to index mail recipients and currently 
has a "starts with" search (non Lucene implementation).  Unless I can set the 
position increment gap, it is only ever possible to search for the first indexed 
recipient with proxity queries.\

I'm trying to ensure the Lucene implementation provides at least the original 
functionality.  If I can't achieve it I can just document the limitation.  If I 
can, I may get false hits, but I still have the choice to filter the hits and 
weed out the false ones before being given to the client.  It's not a 
showstopper, it would be good it it could be done.

Thanks
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Positions in SpanFirst

Posted by Erick Erickson <er...@gmail.com>.

I'm not sure you can, since all the interfaces I use alter the increment
between successive terms, but I'll be the first to admit that there are many
nooks and crannies that I don't know about... But I suspect that a negative
increment is not supported intentionally....

But I really doubt you want to do this due to the consequences. Consider in
your example the terms would have the following offsets
first 0
bit 1
second 0
part 1
third 0
section 1

Now think about a proximity query "first section"~1. This would produce a
hit because you've changed the whole sense of what offsets mean. Is this
really a good thing?

I suspect that the guys who really know things about the internals could
provide some good suggestions if you gave them a better idea of what it is
you're trying to accomplish and why you think SpanFirst helps accomplish
that....

Best
Erick

On 2/21/07, Antony Bowesman <ad...@teamware.com> wrote:
>
> Hi,
>
> I have a field to which I add several bits of information, e.g.
>
> doc.add(new Field("x", "first bit"));
> doc.add(new Field("x", "second part"));
> doc.add(new Field("x", "third section"));
>
> I am using SpanFirstQuery to search them with something like:
>
> while...
>    SpanTermQuery stquery = new SpanTermQuery(new Term("x",
> termStr[incFactor]));
>    query = new SpanFirstQuery(stquery, incFactor);
>    incFactor++
>
> but a search for
>
> "first", span pos 1
> "bit", span pos 2
>
> gets a match, but
>
> "second", span pos 1
> "part", span pos 2
>
> fails.  How can I get the first term position for each word in each Field
> added
> to the document for the same field name to be 1, so that the SpanFirst
> works.
>
> Antony
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>