You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Warren Gallagher <wa...@apxconsult.com> on 2015/03/20 14:43:16 UTC

Looking for some guidance on using PDFBox to analyze page content

 

Greetings, 

Is there a means to determine if a page contains: 

 	* vector graphics
 	* raster graphics (and what format)

Regards, 

WARREN GALLAGHER - CTO

warren.gallagher@apxconsult.com

M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
[1] 

Links:
------
[1] http://apxconsult.com

Re: Looking for some guidance on using PDFBox to analyze page content

Posted by Eliot Kimber <ek...@rsicms.com>.

You can definitely analyze all the raster images in a PDF and get their
format (as stored in the PDF data stream).

Vector may be harder since PDF is fundamentally a drawing language and it
may not be possible to reliably distinguish drawing commands that are just
decorating a page or producing a table and drawing commands that came from
an SVG or Illustrator drawing. But my guess would be that a for a
reasonably-consistent set of PDFs (e.g., all produced using the same
authoring tool or batch formatter) that there should be reliable patterns
you can key off of.

Cheers,

E.
-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.reallysi.com
www.rsuitecms.com

On 3/20/15, 8:43 AM, "Warren Gallagher" <wa...@apxconsult.com>
wrote:

> 
>
>Greetings, 
>
>Is there a means to determine if a page contains:
>
> 	* vector graphics
> 	* raster graphics (and what format)
>
>Regards, 
>
>WARREN GALLAGHER - CTO
>
>warren.gallagher@apxconsult.com
>
>M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
>1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
>[1] 
>
>Links:
>------
>[1] http://apxconsult.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Looking for some guidance on using PDFBox to analyze page content

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

We do a great deal of this and have created two downstream packages which
consume the output of PDFBox:

* https://bitbucket.org/petermr/pdf2svg/ (which translates the PDF into SVG)
* https://bitbucket.org/petermr/svg2xml (which tries to convert the SVG
into high-level constructs)

There are roughly 3 outputs from PDFBox that relate to the viewable page
(we deliberately ignore all metadata, dictionaries, etc as it is likely to
be inconsistent)
* characters either through codepoints (often not  Unicode, unfortunately)
or though pixel-based glyphs
* bitmaps (raster) as Eliot mentions
* graphics paths (move, line, quadratic and cubic bezier).

It is possible for all of these to occur in the same area. However in many
instances the "text" and the "graphics" are separated by whitespace. (We
cannot rely on the order of primitives). We can then use whitespace
heuristics to separate this into "text" , "graphics" and "pixel images".
(Note, however, that text could contain small pixel images for characters,
amd also small paths for underlines, etc.).

Assuming that you have "clean" graphics - such as plots - it is possible
with a great deal of work to extract a reasonable guess at the original
primitives. (For example there is no "circle" or "rectangle" in PDF, only
paths).

It depends on what your material is, how it was produced, what the
primitives are, etc. You are very welcome to try our software which is all
Apache2 licensed.

On Fri, Mar 20, 2015 at 1:43 PM, Warren Gallagher <
warren.gallagher@apxconsult.com> wrote:

>
>
> Greetings,
>
> Is there a means to determine if a page contains:
>
>         * vector graphics
>         * raster graphics (and what format)
>
> Regards,
>
> WARREN GALLAGHER - CTO
>
> warren.gallagher@apxconsult.com
>
> M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
> 1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
> [1]
>
> Links:
> ------
> [1] http://apxconsult.com
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Looking for some guidance on using PDFBox to analyze page content

Posted by Tilman Hausherr <TH...@t-online.de>.

Yes, by analysing the content stream operators (e.g. "l", "c"), but you 
will have the problem that e.g. an underlined text is a drawed font 
(which technically is also vector graphics) and a line. And you won't be 
able to tell easily that this line is related to the font.

Tilman

Am 20.03.2015 um 14:43 schrieb Warren Gallagher:
>   
>
> Greetings,
>
> Is there a means to determine if a page contains:
>
>   	* vector graphics
>   	* raster graphics (and what format)
>
> Regards,
>
> WARREN GALLAGHER - CTO
>
> warren.gallagher@apxconsult.com
>
> M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
> 1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
> [1]
>
> Links:
> ------
> [1] http://apxconsult.com
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org