You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Claudio Corsi <cl...@gmail.com> on 2008/06/06 16:23:15 UTC

Fwd: SpanNearQuery: how to get the "intra-span" matching positions?

Hi,
I'm trying to extend the NearSpansOrdered and NearSpansUnordered classes of
the Lucene core in order to create a way to access to the inner positions of
the current span (in a next() loop). Suppose that the current near span
starts at position N and ends at position N+k, I would discover the
starting/ending positions of all the inner clauses that generate such span.

I'm working on the NearSpansOrdered class right now. I guess that this
modification could be trivial to do, but it requires to me time to
understand the code. Any hints about that?

Actually (as a very inefficient way to proceed) I've added this method to
call *after each next()*, but it doesn't work as aspected:

public Spans[] matchingSpans() {

      ArrayList<Spans> list = new ArrayList<Spans>();
      if (subSpans.length == 0) return null;
      for(int pos = 0; pos < subSpans.length; pos++) {
          if (subSpans[pos].doc() != matchDoc) continue;
          if (subSpans[pos].start() >= matchStart && subSpans[pos].end() <=
matchEnd)
          list.add(subSpans[pos]);
      }
      return list.toArray(new Spans[0]);
}

As you can see, I'm just looping over the subSpans array, filtering the ones
having doc() == matchDoc and which span starts/end inside the current near
span (matchStart and matchEnd are the boundaries returned by start() and
ends() of NearSpansOrdered). This technique doesn't work. Maybe the problem
is that the subSpans are not in the rigth state afte the next() call?

Thank you for any hints!


---------- Forwarded message ----------
From: Paul Elschot <pa...@xs4all.nl>
Date: Fri, May 30, 2008 at 8:51 PM
Subject: Re: SpanNearQuery: how to get the "intra-span" matching positions?
To: java-user@lucene.apache.org


Op Friday 30 May 200812:10 schreef Claudio Corsi:
> Hi all,
> I'm querying my index with a SpanNearQuery built on top of some
> SpanOrQuery. Now, the Spans object I get form the SpanNearQuery
> instance returns me back the sequence of text spans, each defined by
> their starting/ending positions. I'm wondering if there is a simple
> way to get not only the start/end positions of the entire span, but
> the single matching positions inside such span.  For example, suppose
> that a SpanNearQuery composed by 3 SpanTermQuery
> (with a slop of K) produce as resulting span the terms sequence: <t0
> t1 t2 t3 .... t100> (so start() == 0, end() == 100). I know that for
> sure t0 and t100 have generated a match, since the span is "minimal"
> (right?).

Right. But make sure to test, some less than straightforward situations
are possible when matching spans. For example, the subqueries may
be SpanNearQuery's themselves instead of SpanTermQuery's.

> But I also know that there is a 3th match somewhere in the
> span (I have 3 SpanTermQuery that have to match). Is there a way to
> discover it?

To get this information, you'll have to extend NearSpansOrdered and
NearSpansUnordered (package private classes in o.a.l.search.spans)
to also provide for example an int[] with the actual
matching 'positions', or subspans each with their own begin and end.
This is fairly straightforward, but to actually use such positions
SpanScorer will also need to be extended or even replaced.

In case you want to continue this discussion, please do so
on java-dev.

Regards,
Paul Elschot.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




-- 
Claudio Corsi

Re: Fwd: SpanNearQuery: how to get the "intra-span" matching positions?

Posted by Claudio Corsi <cl...@gmail.com>.
Hi!
I've implemented a non efficient (but working) solution to the "intra-span"
matching problem. With these modifications (see the attachment) I have a way
to pick all the matching positions *inside* the current NearSpan using the
new method matchingSpans() (to call after each next()).

There are three files:

1) NearSpans.java (just an interface declaring the matchingSpans method, I
need it for my framework, but it is not mandatory)
2) NearSpansUnordered.java
3) NearSpansOrdered.java

(the package declaration is relative to my project, please ignore it!! ;)

The files 2) and 3) are copies of the one I found in Lucene
2.3.1.NearSpansUnordered just provide the implementation of
matchinSpans() without
any other modifications to the code: it just cycles over the list of
SpanCell kept inside the current instance and filter out the elements whose
doc() is not equals to the current doc() of the Spans, and whose
start()/end() mathcing positions are not "compatible" with the ones of the
current Spans state.

The case of NearSpansOrdered is a little bit more complicated. I had to
maintain track of the subSpans states in the method
shrinkToAfterShortestMatch(). So, I've introduced the subSpansCopy ArrayList
and the inner class SpansCopy that just copies the doc()/start()/end()
values of the passed span. Then I've used this list in the implementation of
matchingSpans() in a way similar to NearSpansUnorderd.

These copies and the matchingSpans() implementations are not very efficient.
I think that this problem can be solved in a better way. But for my
collection and my application it works fine and fast.

Hope that these files will help someone else!

Cheers.


On Fri, Jun 6, 2008 at 6:34 PM, Paul Elschot <pa...@xs4all.nl> wrote:

> See below.
>
> Op Friday 06 June 2008 16:23:15 schreef Claudio Corsi:
> > Hi,
> > I'm trying to extend the NearSpansOrdered and NearSpansUnordered
> > classes of the Lucene core in order to create a way to access to the
> > inner positions of the current span (in a next() loop). Suppose that
> > the current near span starts at position N and ends at position N+k,
> > I would discover the starting/ending positions of all the inner
> > clauses that generate such span.
> >
> > I'm working on the NearSpansOrdered class right now. I guess that
> > this modification could be trivial to do, but it requires to me time
> > to understand the code. Any hints about that?
> >
> > Actually (as a very inefficient way to proceed) I've added this
> > method to call *after each next()*, but it doesn't work as aspected:
> >
> > public Spans[] matchingSpans() {
> >
> >       ArrayList<Spans> list = new ArrayList<Spans>();
> >       if (subSpans.length == 0) return null;
> >       for(int pos = 0; pos < subSpans.length; pos++) {
> >           if (subSpans[pos].doc() != matchDoc) continue;
> >           if (subSpans[pos].start() >= matchStart &&
> > subSpans[pos].end() <= matchEnd)
> >           list.add(subSpans[pos]);
> >       }
> >       return list.toArray(new Spans[0]);
> > }
> >
> > As you can see, I'm just looping over the subSpans array, filtering
> > the ones having doc() == matchDoc and which span starts/end inside
> > the current near span (matchStart and matchEnd are the boundaries
> > returned by start() and ends() of NearSpansOrdered). This technique
> > doesn't work. Maybe the problem is that the subSpans are not in the
> > rigth state afte the next() call?
>
> Correct. The reason is that a match must be minimal length,
> and for that at least the matching subspans at the lowest
> position needs to be advanced beyond its matching position.
> This is the same for both the ordered and unordered case.
>
> So, to implement the matchingSpans() method, it will be necessary
> to copy the subspans when they are at the matching position.
> This will probably involve some fruitless copying for incomplete
> matches that never become a real match.
>
> There is also a difference beyond ordered/unordered.
> In the ordered case, no overlaps between the matching subspans
> are allowed, and in the unordered case overlaps are allowed.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Claudio Corsi

Re: Fwd: SpanNearQuery: how to get the "intra-span" matching positions?

Posted by Paul Elschot <pa...@xs4all.nl>.
See below.

Op Friday 06 June 2008 16:23:15 schreef Claudio Corsi:
> Hi,
> I'm trying to extend the NearSpansOrdered and NearSpansUnordered
> classes of the Lucene core in order to create a way to access to the
> inner positions of the current span (in a next() loop). Suppose that
> the current near span starts at position N and ends at position N+k,
> I would discover the starting/ending positions of all the inner
> clauses that generate such span.
>
> I'm working on the NearSpansOrdered class right now. I guess that
> this modification could be trivial to do, but it requires to me time
> to understand the code. Any hints about that?
>
> Actually (as a very inefficient way to proceed) I've added this
> method to call *after each next()*, but it doesn't work as aspected:
>
> public Spans[] matchingSpans() {
>
>       ArrayList<Spans> list = new ArrayList<Spans>();
>       if (subSpans.length == 0) return null;
>       for(int pos = 0; pos < subSpans.length; pos++) {
>           if (subSpans[pos].doc() != matchDoc) continue;
>           if (subSpans[pos].start() >= matchStart &&
> subSpans[pos].end() <= matchEnd)
>           list.add(subSpans[pos]);
>       }
>       return list.toArray(new Spans[0]);
> }
>
> As you can see, I'm just looping over the subSpans array, filtering
> the ones having doc() == matchDoc and which span starts/end inside
> the current near span (matchStart and matchEnd are the boundaries
> returned by start() and ends() of NearSpansOrdered). This technique
> doesn't work. Maybe the problem is that the subSpans are not in the
> rigth state afte the next() call?

Correct. The reason is that a match must be minimal length,
and for that at least the matching subspans at the lowest
position needs to be advanced beyond its matching position.
This is the same for both the ordered and unordered case.

So, to implement the matchingSpans() method, it will be necessary
to copy the subspans when they are at the matching position.
This will probably involve some fruitless copying for incomplete
matches that never become a real match.

There is also a difference beyond ordered/unordered.
In the ordered case, no overlaps between the matching subspans
are allowed, and in the unordered case overlaps are allowed.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org