You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dave Trombley <da...@gmail.com> on 2023/05/02 13:29:04 UTC

Extracting text at a lower level than PDFTextStripper provides?

Hi all!

I'm trying to extract some specially formatted text from a PDF, and it
seems like it will be impossible to use PDFTextStripper for this task.  In
particular, some of the font style (bold / italic /etc) and color
information is semantically relevant, and what is considered a "paragraph"
depends on this information.

What would be ideal is if there were a way to have a callback of mine
called for each glyph on the page, containing its font, color, size, glyph,
and location in translated / simple page coordinates.  Is there a way to do
something like that?

I've looked at some of the classes that PDFTextStripper derives from, but
it's not clear to me how these work and they seem to have TOO much
information, not at all a simple view of the characters / text
themselves.

Can anyone provide a suggestion?

Thanks,
  David

Re: Extracting text at a lower level than PDFTextStripper provides?

Posted by "Brian L. Matthews" <bl...@gmail.com>.
On 5/2/23 6:29 AM, Dave Trombley wrote:
> I'm trying to extract some specially formatted text from a PDF, and it
> seems like it will be impossible to use PDFTextStripper for this task.  In
> particular, some of the font style (bold / italic /etc) and color
> information is semantically relevant, and what is considered a "paragraph"
> depends on this information.
>
> What would be ideal is if there were a way to have a callback of mine
> called for each glyph on the page, containing its font, color, size, glyph,
> and location in translated / simple page coordinates.  Is there a way to do
> something like that?
>
> I've looked at some of the classes that PDFTextStripper derives from, but
> it's not clear to me how these work and they seem to have TOO much
> information, not at all a simple view of the characters / text
> themselves.

I use something like this fairly often:

     PDFStreamParser     parser = new PDFStreamParser(page);
     List<COSBase>       operands = new ArrayList<>();
     Object              token;

     while ((token = parser.parseNextToken()) != null)
     {
         if (token instanceof COSBase)
         {
             operands.add((COSBase) token);

             continue;
         }

         if (!(token instanceof Operator))
             throw new IllegalArgumentException("Unknown token " + token);

         parseOperator((Operator) token, operands);

         operands.clear();
     }

That's very low level, you have to track positions and transformation 
matrices and graphic states and everything yourself, but you get to see 
everything the PDF says to do and can set state and react accordingly.

Also, I don't think PDF has specific font style operators like, say, 
HTML. So you won't see something like:

turn on italic
draw some text
turn off italic

Instead it will be:

choose a font that happens to be italic
draw some text
reset the current font

Of course I'm nowhere near an expert at PDF or PDFBox, so maybe someone 
else will have a better suggestion.

Brian

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org