You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Justin Lee <le...@gmail.com> on 2016/06/04 21:39:32 UTC

Getting a list of matching terms and offsets

Is anyone aware of a way of getting a list of each matching token and their
offsets after executing a search?  The reason I want to do this is because
I have the physical coordinates of each token in the original document
stored out of band, and I want to be able to highlight in the original
document.  I would really like to have Solr return the list of matching
tokens because then things like stemming and phrase matching will work as
expected. I'm thinking of something like the highlighter component, except
instead of returning html, it would return just the matching tokens and
their offsets.

I have googled high and low and can't seem to find an exact answer to this
question, so I have spent the last few days examining the internals of the
various highlighting classes in Solr and Lucene.  I think the bulk of the
action is in WeightedSpanTermExtractor and its interaction with
getBestTextFragments in the Highlighter class.  But before I spend anymore
time on this I thought I'd ask (1) whether anyone knows of an easier way of
doing this, and (2) whether I'm at least barking up the right tree.

Thanks much,
Justin

Re: Getting a list of matching terms and offsets

Posted by Justin Lee <le...@gmail.com>.

Thank you very much!  That JIRA entry led me to
https://issues.apache.org/jira/browse/SOLR-4722, which still works against
Solr 6 with a couple of modifications and should serve as the basis for
what I want to do.  You saved me a bunch of work, so thanks very much.
 (Also, it is always nice to know that people with more experience than me
took the same approach.)

On Sun, Jun 5, 2016 at 1:09 PM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Lee,
>
> May be you can find useful starting point on
> https://issues.apache.org/jira/browse/SOLR-1397
>
> Please consider to contribute when you gather something working.
>
> Ahmet
>
>
>
>
> On Sunday, June 5, 2016 10:37 PM, Justin Lee <le...@gmail.com>
> wrote:
> Thanks, yea, I looked at debug query too.  Unfortunately the output of
> debug query doesn't quite do it.  For example, if you use a wildcard query,
> it will simply explain the score associated with that wildcard query, not
> the actual matching token.  In order words, if you search for "hour*" and
> the actual matching text is "hours", debug query doesn't tell you that.
> Instead, it just reports the score associated with "hour*".
>
> The closest example I've ever found is this:
>
>
> https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/
>
> But this kind of approach won't let me use the full power of the Solr
> ecosystem.  I'd basically be back to dealing with Lucene directly, which I
> think is a step backwards.  I think the right approach is to write my own
> SearchComponent, using the highlighter as a starting point.  But I wanted
> to make sure there wasn't a simpler way.
>
>
> On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Well debug query has the list of token that caused match.
> > If i am not mistaken i read an example about span query and spans thing.
> > It was listing the positions of the matches.
> > Cannot find the example at the moment..
> >
> > Ahmet
> >
> >
> >
> > On Sunday, June 5, 2016 9:10 PM, Justin Lee <le...@gmail.com>
> > wrote:
> > Thanks for the responses Alex and Ahmet.
> >
> > The TermVector component was the first thing I looked at, but what it
> gives
> > you is offset information for every token in the document.  I'm trying to
> > get a list of tokens that actually match the search query, and unless I'm
> > missing something, the TermVector component doesn't give you that
> > information.
> >
> > The TermSpans class does contain the right information, but again the
> hard
> > part is: how do I reliably get a list of TokenSpans for the tokens that
> > actually match the search query?  That's why I ended up in the
> highlighter
> > source code, because the highlighter has to do just this in order to
> create
> > snippets with accurate highlighting.
> >
> > Justin
> >
> >
> > On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <io...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > May be org.apache.lucene.search.spans.TermSpans ?
> > >
> > >
> > >
> > > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> > arafalov@gmail.com>
> > > wrote:
> > > It sounds like TermVector component's output:
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> > >
> > > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > > tv.positions).
> > >
> > > Regards,
> > >    Alex.
> > > ----
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > >
> > > On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> > > > Is anyone aware of a way of getting a list of each matching token and
> > > their
> > > > offsets after executing a search?  The reason I want to do this is
> > > because
> > > > I have the physical coordinates of each token in the original
> document
> > > > stored out of band, and I want to be able to highlight in the
> original
> > > > document.  I would really like to have Solr return the list of
> matching
> > > > tokens because then things like stemming and phrase matching will
> work
> > as
> > > > expected. I'm thinking of something like the highlighter component,
> > > except
> > > > instead of returning html, it would return just the matching tokens
> and
> > > > their offsets.
> > > >
> > > > I have googled high and low and can't seem to find an exact answer to
> > > this
> > > > question, so I have spent the last few days examining the internals
> of
> > > the
> > > > various highlighting classes in Solr and Lucene.  I think the bulk of
> > the
> > > > action is in WeightedSpanTermExtractor and its interaction with
> > > > getBestTextFragments in the Highlighter class.  But before I spend
> > > anymore
> > > > time on this I thought I'd ask (1) whether anyone knows of an easier
> > way
> > > of
> > > > doing this, and (2) whether I'm at least barking up the right tree.
> > > >
> > > > Thanks much,
> > > > Justin
> > >
> >
>

Re: Getting a list of matching terms and offsets

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Lee,

May be you can find useful starting point on 
https://issues.apache.org/jira/browse/SOLR-1397

Please consider to contribute when you gather something working.

Ahmet




On Sunday, June 5, 2016 10:37 PM, Justin Lee <le...@gmail.com> wrote:
Thanks, yea, I looked at debug query too.  Unfortunately the output of
debug query doesn't quite do it.  For example, if you use a wildcard query,
it will simply explain the score associated with that wildcard query, not
the actual matching token.  In order words, if you search for "hour*" and
the actual matching text is "hours", debug query doesn't tell you that.
Instead, it just reports the score associated with "hour*".

The closest example I've ever found is this:

https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/

But this kind of approach won't let me use the full power of the Solr
ecosystem.  I'd basically be back to dealing with Lucene directly, which I
think is a step backwards.  I think the right approach is to write my own
SearchComponent, using the highlighter as a starting point.  But I wanted
to make sure there wasn't a simpler way.


On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Well debug query has the list of token that caused match.
> If i am not mistaken i read an example about span query and spans thing.
> It was listing the positions of the matches.
> Cannot find the example at the moment..
>
> Ahmet
>
>
>
> On Sunday, June 5, 2016 9:10 PM, Justin Lee <le...@gmail.com>
> wrote:
> Thanks for the responses Alex and Ahmet.
>
> The TermVector component was the first thing I looked at, but what it gives
> you is offset information for every token in the document.  I'm trying to
> get a list of tokens that actually match the search query, and unless I'm
> missing something, the TermVector component doesn't give you that
> information.
>
> The TermSpans class does contain the right information, but again the hard
> part is: how do I reliably get a list of TokenSpans for the tokens that
> actually match the search query?  That's why I ended up in the highlighter
> source code, because the highlighter has to do just this in order to create
> snippets with accurate highlighting.
>
> Justin
>
>
> On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > May be org.apache.lucene.search.spans.TermSpans ?
> >
> >
> >
> > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> arafalov@gmail.com>
> > wrote:
> > It sounds like TermVector component's output:
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> >
> > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > tv.positions).
> >
> > Regards,
> >    Alex.
> > ----
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> >
> > On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> > > Is anyone aware of a way of getting a list of each matching token and
> > their
> > > offsets after executing a search?  The reason I want to do this is
> > because
> > > I have the physical coordinates of each token in the original document
> > > stored out of band, and I want to be able to highlight in the original
> > > document.  I would really like to have Solr return the list of matching
> > > tokens because then things like stemming and phrase matching will work
> as
> > > expected. I'm thinking of something like the highlighter component,
> > except
> > > instead of returning html, it would return just the matching tokens and
> > > their offsets.
> > >
> > > I have googled high and low and can't seem to find an exact answer to
> > this
> > > question, so I have spent the last few days examining the internals of
> > the
> > > various highlighting classes in Solr and Lucene.  I think the bulk of
> the
> > > action is in WeightedSpanTermExtractor and its interaction with
> > > getBestTextFragments in the Highlighter class.  But before I spend
> > anymore
> > > time on this I thought I'd ask (1) whether anyone knows of an easier
> way
> > of
> > > doing this, and (2) whether I'm at least barking up the right tree.
> > >
> > > Thanks much,
> > > Justin
> >
>

Re: Getting a list of matching terms and offsets

Posted by Justin Lee <le...@gmail.com>.

Thanks, yea, I looked at debug query too.  Unfortunately the output of
debug query doesn't quite do it.  For example, if you use a wildcard query,
it will simply explain the score associated with that wildcard query, not
the actual matching token.  In order words, if you search for "hour*" and
the actual matching text is "hours", debug query doesn't tell you that.
Instead, it just reports the score associated with "hour*".

The closest example I've ever found is this:

https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/

But this kind of approach won't let me use the full power of the Solr
ecosystem.  I'd basically be back to dealing with Lucene directly, which I
think is a step backwards.  I think the right approach is to write my own
SearchComponent, using the highlighter as a starting point.  But I wanted
to make sure there wasn't a simpler way.

On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Well debug query has the list of token that caused match.
> If i am not mistaken i read an example about span query and spans thing.
> It was listing the positions of the matches.
> Cannot find the example at the moment..
>
> Ahmet
>
>
>
> On Sunday, June 5, 2016 9:10 PM, Justin Lee <le...@gmail.com>
> wrote:
> Thanks for the responses Alex and Ahmet.
>
> The TermVector component was the first thing I looked at, but what it gives
> you is offset information for every token in the document.  I'm trying to
> get a list of tokens that actually match the search query, and unless I'm
> missing something, the TermVector component doesn't give you that
> information.
>
> The TermSpans class does contain the right information, but again the hard
> part is: how do I reliably get a list of TokenSpans for the tokens that
> actually match the search query?  That's why I ended up in the highlighter
> source code, because the highlighter has to do just this in order to create
> snippets with accurate highlighting.
>
> Justin
>
>
> On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > May be org.apache.lucene.search.spans.TermSpans ?
> >
> >
> >
> > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> arafalov@gmail.com>
> > wrote:
> > It sounds like TermVector component's output:
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> >
> > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > tv.positions).
> >
> > Regards,
> >    Alex.
> > ----
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> >
> > On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> > > Is anyone aware of a way of getting a list of each matching token and
> > their
> > > offsets after executing a search?  The reason I want to do this is
> > because
> > > I have the physical coordinates of each token in the original document
> > > stored out of band, and I want to be able to highlight in the original
> > > document.  I would really like to have Solr return the list of matching
> > > tokens because then things like stemming and phrase matching will work
> as
> > > expected. I'm thinking of something like the highlighter component,
> > except
> > > instead of returning html, it would return just the matching tokens and
> > > their offsets.
> > >
> > > I have googled high and low and can't seem to find an exact answer to
> > this
> > > question, so I have spent the last few days examining the internals of
> > the
> > > various highlighting classes in Solr and Lucene.  I think the bulk of
> the
> > > action is in WeightedSpanTermExtractor and its interaction with
> > > getBestTextFragments in the Highlighter class.  But before I spend
> > anymore
> > > time on this I thought I'd ask (1) whether anyone knows of an easier
> way
> > of
> > > doing this, and (2) whether I'm at least barking up the right tree.
> > >
> > > Thanks much,
> > > Justin
> >
>

Re: Getting a list of matching terms and offsets

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Well debug query has the list of token that caused match.
If i am not mistaken i read an example about span query and spans thing.
It was listing the positions of the matches.
Cannot find the example at the moment..

Ahmet

On Sunday, June 5, 2016 9:10 PM, Justin Lee <le...@gmail.com> wrote:
Thanks for the responses Alex and Ahmet.

The TermVector component was the first thing I looked at, but what it gives
you is offset information for every token in the document.  I'm trying to
get a list of tokens that actually match the search query, and unless I'm
missing something, the TermVector component doesn't give you that
information.

The TermSpans class does contain the right information, but again the hard
part is: how do I reliably get a list of TokenSpans for the tokens that
actually match the search query?  That's why I ended up in the highlighter
source code, because the highlighter has to do just this in order to create
snippets with accurate highlighting.

Justin

On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> May be org.apache.lucene.search.spans.TermSpans ?
>
>
>
> On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> It sounds like TermVector component's output:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
>
> Perhaps with additional flags enabled (e.g. tv.offsets and/or
> tv.positions).
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
>
> On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> > Is anyone aware of a way of getting a list of each matching token and
> their
> > offsets after executing a search?  The reason I want to do this is
> because
> > I have the physical coordinates of each token in the original document
> > stored out of band, and I want to be able to highlight in the original
> > document.  I would really like to have Solr return the list of matching
> > tokens because then things like stemming and phrase matching will work as
> > expected. I'm thinking of something like the highlighter component,
> except
> > instead of returning html, it would return just the matching tokens and
> > their offsets.
> >
> > I have googled high and low and can't seem to find an exact answer to
> this
> > question, so I have spent the last few days examining the internals of
> the
> > various highlighting classes in Solr and Lucene.  I think the bulk of the
> > action is in WeightedSpanTermExtractor and its interaction with
> > getBestTextFragments in the Highlighter class.  But before I spend
> anymore
> > time on this I thought I'd ask (1) whether anyone knows of an easier way
> of
> > doing this, and (2) whether I'm at least barking up the right tree.
> >
> > Thanks much,
> > Justin
>

Re: Getting a list of matching terms and offsets

Posted by Justin Lee <le...@gmail.com>.

Thanks for the responses Alex and Ahmet.

The TermVector component was the first thing I looked at, but what it gives
you is offset information for every token in the document.  I'm trying to
get a list of tokens that actually match the search query, and unless I'm
missing something, the TermVector component doesn't give you that
information.

The TermSpans class does contain the right information, but again the hard
part is: how do I reliably get a list of TokenSpans for the tokens that
actually match the search query?  That's why I ended up in the highlighter
source code, because the highlighter has to do just this in order to create
snippets with accurate highlighting.

Justin

On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> May be org.apache.lucene.search.spans.TermSpans ?
>
>
>
> On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> It sounds like TermVector component's output:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
>
> Perhaps with additional flags enabled (e.g. tv.offsets and/or
> tv.positions).
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
>
> On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> > Is anyone aware of a way of getting a list of each matching token and
> their
> > offsets after executing a search?  The reason I want to do this is
> because
> > I have the physical coordinates of each token in the original document
> > stored out of band, and I want to be able to highlight in the original
> > document.  I would really like to have Solr return the list of matching
> > tokens because then things like stemming and phrase matching will work as
> > expected. I'm thinking of something like the highlighter component,
> except
> > instead of returning html, it would return just the matching tokens and
> > their offsets.
> >
> > I have googled high and low and can't seem to find an exact answer to
> this
> > question, so I have spent the last few days examining the internals of
> the
> > various highlighting classes in Solr and Lucene.  I think the bulk of the
> > action is in WeightedSpanTermExtractor and its interaction with
> > getBestTextFragments in the Highlighter class.  But before I spend
> anymore
> > time on this I thought I'd ask (1) whether anyone knows of an easier way
> of
> > doing this, and (2) whether I'm at least barking up the right tree.
> >
> > Thanks much,
> > Justin
>

Re: Getting a list of matching terms and offsets

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

May be org.apache.lucene.search.spans.TermSpans ?



On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
It sounds like TermVector component's output:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component

Perhaps with additional flags enabled (e.g. tv.offsets and/or tv.positions).

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/



On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> Is anyone aware of a way of getting a list of each matching token and their
> offsets after executing a search?  The reason I want to do this is because
> I have the physical coordinates of each token in the original document
> stored out of band, and I want to be able to highlight in the original
> document.  I would really like to have Solr return the list of matching
> tokens because then things like stemming and phrase matching will work as
> expected. I'm thinking of something like the highlighter component, except
> instead of returning html, it would return just the matching tokens and
> their offsets.
>
> I have googled high and low and can't seem to find an exact answer to this
> question, so I have spent the last few days examining the internals of the
> various highlighting classes in Solr and Lucene.  I think the bulk of the
> action is in WeightedSpanTermExtractor and its interaction with
> getBestTextFragments in the Highlighter class.  But before I spend anymore
> time on this I thought I'd ask (1) whether anyone knows of an easier way of
> doing this, and (2) whether I'm at least barking up the right tree.
>
> Thanks much,
> Justin

Re: Getting a list of matching terms and offsets

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

It sounds like TermVector component's output:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component

Perhaps with additional flags enabled (e.g. tv.offsets and/or tv.positions).

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 5 June 2016 at 07:39, Justin Lee <le...@gmail.com> wrote:
> Is anyone aware of a way of getting a list of each matching token and their
> offsets after executing a search?  The reason I want to do this is because
> I have the physical coordinates of each token in the original document
> stored out of band, and I want to be able to highlight in the original
> document.  I would really like to have Solr return the list of matching
> tokens because then things like stemming and phrase matching will work as
> expected. I'm thinking of something like the highlighter component, except
> instead of returning html, it would return just the matching tokens and
> their offsets.
>
> I have googled high and low and can't seem to find an exact answer to this
> question, so I have spent the last few days examining the internals of the
> various highlighting classes in Solr and Lucene.  I think the bulk of the
> action is in WeightedSpanTermExtractor and its interaction with
> getBestTextFragments in the Highlighter class.  But before I spend anymore
> time on this I thought I'd ask (1) whether anyone knows of an easier way of
> doing this, and (2) whether I'm at least barking up the right tree.
>
> Thanks much,
> Justin