You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Renaud Waldura <re...@library.ucsf.edu> on 2006/11/09 19:55:41 UTC

PDF Highlighting Again

Greetings:
 
I read the mailing-list archives about this topic and found the PDFBox
solutions at: http://www.pdfbox.org/userguide/highlighting.html

Basically there are 3 options:
1- append query parameters to the PDF URL
2- generate a highlight XML document that Acrobat Reader will download
separately
3- alter contents of the PDF

I've only tried (1) so far and I really like the user experience: Acrobat
Reader displays a list of matches in a small pane on the right. By clicking
in this pane the user can bounce around the document. I don't think (2) or
(3) can match that.

I implemented (1) by rewriting the Lucene query and creating a query string
with the terms. I am now facing a problem where the number of terms expand
beyond the limits of the query string.

I realized it's silly to ask Acrobat to highlight all the query terms when
in reality only a subset of them is present in the document. But how can I
compute that subset?

I'm thinking I might have to tokenize the document text (I have it), then
compute the intersection between the set of all terms and the set of terms
from the rewritten query. Blech. Sounds expensive. Any other ideas?

How does the contrib'ed highlighter work? I'm guessing it's doing something
similar. Any way I could not tokenize the document twice (they can be really
long)? 

--Renaud




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PDF Highlighting Again

Posted by Daniel Naber <lu...@danielnaber.de>.

On Thursday 09 November 2006 19:55, Renaud Waldura wrote:

> I'm thinking I might have to tokenize the document text (I have it),
> then compute the intersection between the set of all terms and the set
> of terms from the rewritten query. Blech. Sounds expensive. Any other
> ideas?

No faster, but easier: index the Document in a RAMDirectory, open that with 
an IndexReader und use this reader to do the first (and only) rewrite. As 
this only happens when the user really clicks a match I see no problems 
with performance.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org