You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/05 03:27:55 UTC

in document highlighting

Another compelling reason for better pdf parsing is it should enable the
ability to do in document highlighting sometime in the future.
 

Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

 

RE: in document highlighting

Posted by Richard Braman <rb...@bramantax.com>.
FYI

-----Original Message-----
From: itext-questions-admin@lists.sourceforge.net
[mailto:itext-questions-admin@lists.sourceforge.net] On Behalf Of
chris@ovitas.no
Sent: Thursday, March 09, 2006 6:01 AM
To: itext-questions@lists.sourceforge.net
Subject: [iText-questions] Extracting text location for highlighting in
reader


I'm looking into how you can ask the acrobat reader web plugin to
highlight words so that we can get hit-highlighting of web search
working for an application.

I've read thru this document: 

http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pd
f

It seems that you pass an XML document to the reader defining where you
want highlighting.

However I then need to know the offset on the page of where I want to
highlight (offset is a count either in characters or words).

So - is iText a good way to extract just the text of a page so that we
can use it to calculate the offsets?

-- 
Chris

At 06:01 AM 3/9/2006, chris@ovitas.no wrote:
>http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.p
>df
>
>It seems that you pass an XML document to the reader defining where you

>want highlighting.

         Correct.


>However I then need to know the offset on the page of where I want to 
>highlight (offset is a count either in characters or words).

         Correct.


>So - is iText a good way to extract just the text of a page so that we 
>can use it to calculate the offsets?

         No.

         Look at PdfBox or Multivalent.


Leonard



On Thu, Mar 09, 2006 at 06:50:09AM -0500, Leonard Rosenthol wrote:
> 
> >So - is iText a good way to extract just the text of a page so that 
> >we can use it to calculate the offsets?
> 
>         No.
> 
>         Look at PdfBox or Multivalent.

Thanks for the pointer. Seems like the char offset method isn't too
reliable (something that's 150 chars inside the text fiel from PDFBox is
200 chars in 
according to the highlighter in reader.

But - with word based offset (and a lot of guesswork as to what acrobat
reader thinks is a word boundary) then this looks like it might actually
fly :)

-- 
Chris Searle
chris@ovitas.no