You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Wolanin <pe...@acquia.com> on 2011/12/05 18:42:10 UTC

Retrieving matched tokens and their payload?

A colleague came to be with a problem that intrigued me.  I can see
partly how to solve it with Solr, but looking for insight into solving
the last step.

The problem:

1) Start from a set of text transcriptions of videos where there is a
timestamp associated with each word.

2) Index into Solr with analysis including stemming, so that a user
can search for videos based on keywords.

3) When the user clicks into a single video in the search result,
retrieve from the corresponding doc in Solr the timestamps of all
words matching the keyword(s) (including stemming).

So, obviously #1 and 2 are easy.  As part of #2 it would seem one
could use the DelimitedPayloadTokenFilterFactory to index the
timestamp as a payload for each word.  I don't want the payload to
influence score, but my understanding is that by default it will not.

Ok, so now for the harder part.  For #3 it would seem I need something
roughly like the highlighter - to return each matching word and the
payload which is the timestamp.

I'm not seeing any existing request handler or component that would do
this.  Is there an easy way to retrieve the indexed words (or analyzed
tokens) and their payload?

Thanks,

-Peter


--
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com"

Re: Retrieving matched tokens and their payload?

Posted by Chris Hostetter <ho...@fucit.org>.
: 3) When the user clicks into a single video in the search result,
: retrieve from the corresponding doc in Solr the timestamps of all
: words matching the keyword(s) (including stemming).
	...
: Ok, so now for the harder part.  For #3 it would seem I need something
: roughly like the highlighter - to return each matching word and the
: payload which is the timestamp.
: 
: I'm not seeing any existing request handler or component that would do
: this.  Is there an easy way to retrieve the indexed words (or analyzed
: tokens) and their payload?

I suspect the easiest way to go about this would be a request handler that 
used SpanQuery to find all the matchings Spans on the document, and then 
while iterating over the span call getPayload().

Of course: this assumes your queries can all be representd as SpanQueries 
(so no numerics or function queries or anything too cray ... but if the 
goal is searching words in video transcripts that shouldn't be a show 
stopper)

-Hoss