You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jan Høydahl <ja...@cominvent.com> on 2011/10/10 16:19:48 UTC

In-document highlighting DocValues?

Hi,

We index structured documents, with numbered chapters, paragraphs and sentences. After doing a (rather complex) search, we may get multiple matches in each result doc. We want to highlight those matches in our front-end and currently we do a simple string match of the query words against the raw text.

However, this highlights some words that do not satisfy the original query, and also does not highlight other words where the match was in a stem, or synonym or wildcard. We thus need to improve this, and my plan was to utilize DocValues (Payloads). Would the following work?

1. For each term in the field "text", index DocValues with info about chapter#, paragraph#, sentence# and word#.
This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2, sentence 3 and word 4.

2. Then, for a specific document in the result list, retrieve a list of all matches in field "text", and for each match,
retrieve the associated DocValues.

3. The client application can now use this information to highlight matches, as well as "jump to next match" etc,
and would highlight the correct words only, e.g. it would be able to highlight "colour" even if the match was on the
synonym "color".

Another use case for this technique would be OCR applications where we store with each term its x,y offsets for where it occurs in
the original TIFF image scan.

What is in already in place and what code needs to be written? I don't currently see how to get a complete list of matches for a particular document.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

Re: In-document highlighting DocValues?

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Looking more at the new DocValues for 4.0, they are only per-document, right?

So I guess what I'm thinking is to use the good old Payloads per term to store this info. Since that's a single value, we could encode the values as byte[] somehow.

But the crucial point here is how to iterate through every single matching term in a field and pull out the payloads?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. okt. 2011, at 16:19, Jan Høydahl wrote:

> Hi,
> 
> We index structured documents, with numbered chapters, paragraphs and sentences. After doing a (rather complex) search, we may get multiple matches in each result doc. We want to highlight those matches in our front-end and currently we do a simple string match of the query words against the raw text.
> 
> However, this highlights some words that do not satisfy the original query, and also does not highlight other words where the match was in a stem, or synonym or wildcard. We thus need to improve this, and my plan was to utilize DocValues (Payloads). Would the following work?
> 
> 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#, sentence# and word#.
>   This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2, sentence 3 and word 4.
> 
> 2. Then, for a specific document in the result list, retrieve a list of all matches in field "text", and for each match,
>   retrieve the associated DocValues.
> 
> 3. The client application can now use this information to highlight matches, as well as "jump to next match" etc,
>   and would highlight the correct words only, e.g. it would be able to highlight "colour" even if the match was on the
>   synonym "color".
> 
> Another use case for this technique would be OCR applications where we store with each term its x,y offsets for where it occurs in
> the original TIFF image scan.
> 
> What is in already in place and what code needs to be written? I don't currently see how to get a complete list of matches for a particular document.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>

Re: In-document highlighting DocValues?

Posted by Michael Sokolov <so...@ifactory.com>.

On 10/14/2011 7:20 PM, Jan Høydahl wrote:
> Hi,
>
> The Highlighter is way too slow for this customer's particular use case - which is veery large documents. We don't need highlighted snippets for now, but we need to accurately decide what words (offsets) in the real HTML display of the resulting page to highlight. For this we only need offset info, not the snippets/fragments from the stored field.
>
> But I have not looked at the Highlighter code. Perhaps we could fork it into a new search component which pulls out only the necessary meta info and payloads for us and returns it to client?
>
Jan I've looked into this, and I believe the slowness of Highlighter 
doesn't have to do with constructing the snippets as much as with the 
analysis that is required to find the locations of matching terms in the 
document text, so I think your problem is basically the same as 
highlighting.

There seem to be basically two approaches right now: one is Highlighter, 
which is a you point out is a bit slow because it has to basically 
re-analyze the entire document, but this does have the virtue of an 
exact match to the semantics of the original query.  
FastVectorHighlighter works by doing some cheap mimicry of the original 
query, extracting terms from the query (and also intersecting with the 
document too, if you have MultiTermQuery), and finding the offsets of 
those terms (which have to be stored in the index).  It is smart enough 
to respect phrase boundaries, but does not support every kind of Query; 
however it might be good enough, and is quite a bit faster than 
Highlighter (5-10x I think?).

The work in LUCENE-2878 is the only thing I know of that could represent 
an improvement.  I did some tests there including storing character 
offsets as payloads and got some additional speedup (maybe another 2x?) 
beyond FVH.  There doesn't seem to be a lot of energy into pushing that 
ahead right now though, and it requires some fundamental changes to the 
way that searching is done.

-Mike

Re: In-document highlighting DocValues?

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

The Highlighter is way too slow for this customer's particular use case - which is veery large documents. We don't need highlighted snippets for now, but we need to accurately decide what words (offsets) in the real HTML display of the resulting page to highlight. For this we only need offset info, not the snippets/fragments from the stored field.

But I have not looked at the Highlighter code. Perhaps we could fork it into a new search component which pulls out only the necessary meta info and payloads for us and returns it to client?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. okt. 2011, at 16:23, Mike Sokolov wrote:

> Is there some reason you don't want to leverage Highlighter to do this work?  It has all the necessary code for using the analyzed version of your query so it will only match tokens that really contribute to the search match.
> 
> You might also be interested in LUCENE-2878 (which is still under development on a branch though).  It aims to provide first-class access to payloads and positions during scoring, and this will be very useful for complex highlighting tasks.
> 
> Another possible solution to the OCR problem could be:  generate an XML file with a tag for each word encoding its x,y coords, like : <word x="3" y="10">This</word>; index that file using XmlCharFilter or HTMLStripCharFilter. Then when you search, use the Solr highlighter to highlight the entire document, and process it using XML tools to find the locations of the matches.
> 
> -Mike
> 
> On 10/10/2011 10:19 AM, Jan Høydahl wrote:
>> Hi,
>> 
>> We index structured documents, with numbered chapters, paragraphs and sentences. After doing a (rather complex) search, we may get multiple matches in each result doc. We want to highlight those matches in our front-end and currently we do a simple string match of the query words against the raw text.
>> 
>> However, this highlights some words that do not satisfy the original query, and also does not highlight other words where the match was in a stem, or synonym or wildcard. We thus need to improve this, and my plan was to utilize DocValues (Payloads). Would the following work?
>> 
>> 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#, sentence# and word#.
>>    This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2, sentence 3 and word 4.
>> 
>> 2. Then, for a specific document in the result list, retrieve a list of all matches in field "text", and for each match,
>>    retrieve the associated DocValues.
>> 
>> 3. The client application can now use this information to highlight matches, as well as "jump to next match" etc,
>>    and would highlight the correct words only, e.g. it would be able to highlight "colour" even if the match was on the
>>    synonym "color".
>> 
>> Another use case for this technique would be OCR applications where we store with each term its x,y offsets for where it occurs in
>> the original TIFF image scan.
>> 
>> What is in already in place and what code needs to be written? I don't currently see how to get a complete list of matches for a particular document.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>>

Re: In-document highlighting DocValues?

Posted by Mike Sokolov <so...@ifactory.com>.

Is there some reason you don't want to leverage Highlighter to do this 
work?  It has all the necessary code for using the analyzed version of 
your query so it will only match tokens that really contribute to the 
search match.

You might also be interested in LUCENE-2878 (which is still under 
development on a branch though).  It aims to provide first-class access 
to payloads and positions during scoring, and this will be very useful 
for complex highlighting tasks.

Another possible solution to the OCR problem could be:  generate an XML 
file with a tag for each word encoding its x,y coords, like : <word 
x="3" y="10">This</word>; index that file using XmlCharFilter or 
HTMLStripCharFilter. Then when you search, use the Solr highlighter to 
highlight the entire document, and process it using XML tools to find 
the locations of the matches.

-Mike

On 10/10/2011 10:19 AM, Jan Høydahl wrote:
> Hi,
>
> We index structured documents, with numbered chapters, paragraphs and sentences. After doing a (rather complex) search, we may get multiple matches in each result doc. We want to highlight those matches in our front-end and currently we do a simple string match of the query words against the raw text.
>
> However, this highlights some words that do not satisfy the original query, and also does not highlight other words where the match was in a stem, or synonym or wildcard. We thus need to improve this, and my plan was to utilize DocValues (Payloads). Would the following work?
>
> 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#, sentence# and word#.
>     This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2, sentence 3 and word 4.
>
> 2. Then, for a specific document in the result list, retrieve a list of all matches in field "text", and for each match,
>     retrieve the associated DocValues.
>
> 3. The client application can now use this information to highlight matches, as well as "jump to next match" etc,
>     and would highlight the correct words only, e.g. it would be able to highlight "colour" even if the match was on the
>     synonym "color".
>
> Another use case for this technique would be OCR applications where we store with each term its x,y offsets for where it occurs in
> the original TIFF image scan.
>
> What is in already in place and what code needs to be written? I don't currently see how to get a complete list of matches for a particular document.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
>