You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Peter Murray-Rust <pm...@cam.ac.uk> on 2012/05/07 11:03:25 UTC

Re: Extracting vector graphics from PDF

I have followed Andrey's recommendation and am able to analyze graphics
primitives written to MyGraphics2D object (I generally use Batik's
SVGGraphics2D as I have an SVG toolkit for analyzing the primitives). I am
having difficulty with the Font information.

On Mon, Apr 2, 2012 at 2:51 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Peter, you have to pass your own Graphics2D object (with some overridden
> methods) to pdfbox.
>
>
> MyGraphics2D extends Graphics2D {
>
>     public void setFont(Font) { ...}

I am using pdfbox-1.6.0 from Maven and using my own version of PDReader
which captures the graphics. When I capture the font as above it is always:

java.awt.Font[family=null,name=null,style=plain,size=1

When using PDFReader, however, all the glyphs are correctly drawn,
including italic and a variety of fonts. I have also managed to capture
graphic information where all the characters were outlines (SVGPath) rather
than SVGText suggesting that PDFReader had written the glyphs directly.

I'd be grateful for any pointers as to how I can capture either or both of
the Font or glyph information and what is actually happening when the
information is passed. (I am quite prepared to work with the glyphs as
there are some documents where, I think, only glyph information is provided
so I have to do some analysis there.

Peter

> --
>
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

Thanks Andrey,

Here are the SVG files from the two versions (I apologize for the verbosity
but people may want to inspect the paths:

1.7.0 paths only (glyphs)
<svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink"
color-rendering="auto" color-interpolation="auto" stroke="black"
text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10"
stroke-opacity="1" shape-rendering="auto" fill="black"
stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns="
http://www.w3.org/2000/svg" font-family="&apos;Dialog&apos;"
font-style="normal" stroke-linejoin="miter" font-size="12"
stroke-dashoffset="0" image-rendering="auto">
  <!--Generated by the Batik Graphics2D SVG Generator-->
  <defs id="genericDefs" />
  <g>
    <defs id="defs1">
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1">
        <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2">
        <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3">
        <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4">
        <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5">
        <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" />
      </clipPath>
    </defs>
    <g text-rendering="optimizeLegibility"
transform="matrix(9.7634,0,0,9.7634,0,0)">
      <path d="M6.2598 9.3956 L6.2598 9.6299 C6.2598 9.7549 6.2598 9.8956
6.2598 9.9424 L6.3379 9.9737 L6.3379 9.9893 C6.2911 9.9893 6.2598 9.9893
6.2129 9.9893 C6.1661 9.9893 6.1348 9.9893 6.0879 9.9893 L6.0879 9.9737
L6.1661 9.9424 C6.1661 9.8643 6.1661 9.8018 6.1661 9.7393 L6.1661 9.7393
C6.1348 9.7706 6.0879 9.7862 6.0567 9.7862 C5.9473 9.7862 5.8379 9.6924
5.8379 9.5518 C5.8379 9.4268 5.9629 9.3174 6.0879 9.3174 C6.1192 9.3174
6.1504 9.3331 6.1817 9.3643 C6.1973 9.3331 6.2598 9.3174 6.2911 9.3174
L6.2911 9.3174 C6.2754 9.3487 6.2598 9.3799 6.2598 9.3956 ZM6.1661 9.7081
L6.1661 9.3956 C6.1661 9.3799 6.1348 9.3643 6.1036 9.3643 C6.0098 9.3643
5.9317 9.4424 5.9317 9.5518 C5.9317 9.6612 6.0098 9.7237 6.0879 9.7237
C6.1192 9.7237 6.1504 9.7237 6.1661 9.7081 Z" clip-path="url(#clipPath1)"
stroke="none" />
      <path d="M6.455 9.3799 L6.4081 9.3643 L6.4081 9.3487 C6.4394 9.3331
6.5175 9.3331 6.5488 9.3331 C6.5488 9.3643 6.5488 9.4424 6.5488 9.6143
C6.5488 9.6612 6.5488 9.6768 6.5644 9.6924 C6.58 9.7081 6.5956 9.7237
6.6269 9.7237 C6.6738 9.7237 6.7206 9.6924 6.7363 9.6612 L6.7363 9.3956
C6.7363 9.3799 6.7206 9.3799 6.6738 9.3643 L6.6894 9.3331 C6.7206 9.3331
6.7988 9.3174 6.83 9.3331 C6.83 9.3799 6.83 9.4424 6.83 9.6924 C6.8456
9.7237 6.8769 9.7393 6.9081 9.7549 L6.8925 9.7706 C6.8925 9.7706 6.8613
9.7862 6.8456 9.7862 C6.7988 9.7862 6.7519 9.7706 6.7519 9.7081 L6.7363
9.7081 C6.705 9.7549 6.6581 9.7862 6.5956 9.7862 C6.5644 9.7862 6.5175
9.7706 6.5019 9.7393 C6.4706 9.7237 6.455 9.6768 6.455 9.6299 C6.455 9.5518
6.455 9.4581 6.455 9.3799 Z" clip-path="url(#clipPath1)" stroke="none" />

and 1.6.0
<svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink"
color-rendering="auto" color-interpolation="auto" stroke="black"
text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10"
stroke-opacity="1" shape-rendering="auto" fill="black"
stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns="
http://www.w3.org/2000/svg" font-family="&apos;Dialog&apos;"
font-style="normal" stroke-linejoin="miter" font-size="12"
stroke-dashoffset="0" image-rendering="auto">
  <!--Generated by the Batik Graphics2D SVG Generator-->
  <defs id="genericDefs" />
  <g>
    <defs id="defs1">
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1">
        <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2">
        <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3">
        <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4">
        <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5">
        <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" />
      </clipPath>
    </defs>
    <g text-rendering="optimizeLegibility" font-size="1"
font-family="&apos;null&apos;" transform="matrix(9.7634,0,0,9.7634,0,0)">
      <text xml:space="preserve" x="5.8067" y="9.7706"
clip-path="url(#clipPath1)" stroke="none">q</text>
      <text xml:space="preserve" x="6.3769" y="9.7706"
clip-path="url(#clipPath1)" stroke="none">u</text>
(this is part of the string "question")

I have verified that the glyphs correspond to "q" and "u". There is a
useful heuristic in that the clipPaths appear to be coupled to the fonts
(including , I think, different font-sizes) so it effectively records the
fonts, their glyphs and their metrics. I am assuming that if I knew how I
could dump the font information (presumably through the COSDictionary).
That would give me most of what I need:
* the character (from 1.6.0)
* the character position (from 1.6.0)
* the glyph (from 1.7.0) giving (i) the coordinate origin (ii) the width
and height and (iii) an indication of italic (neither 1.7.0 and 1.6.0
decode the glyph as itallic so I will have to use heuristics

This is very tedious, but at least it's possible. However I would suggest
to the PDFBox developers that they preserve the character info when
transmitting to the drawing surface Graphics2D. This would allow different
fonts, even if not as beautiful.

On Tue, May 8, 2012 at 5:44 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Hi Peter,****
>
> ** **
>
> >When I use 1.7.0 NO text is written. Instead the characters are replaced
> by outline glyphs using <svg:path>. The visual layout is effectively the
> same as the input PDF >but there are no explicit characters.****
>
> ** **
>
> Wow! They managed to implement it like Adobe suggested!****
>
> **
>
Good. We understand each other I think!


> **
>
> >I guess that in 1.7.0 NO characters are transmitted to drawString and
> that everything is drawShape(), with the precomputed glyphs.****
>
> Yes, you have to trace from where g.draw(Shape) / g.fill(Shape) is coming.
> ****
>
> Not easy task however, since paths are also drawn with same methods.****
>
> **
>

That's my problem. If I can get in and add simple attributes to carry text
info through to the surface that would be great. As it is it is very
difficult (not impossible) to hack the glyph stream. We have to assume that
text mainly occurs in rows of glyphs with the same y-coordinate, but this
varies slightly because of the different glyph origins.

>  **
>
> May be I’ll download new version and look deeper…****
>
> **
>

I, for one, would be grateful if you did! I thought I was
miscompiling/omitting some resource, etc. which caused different output. It
took a while to realise it was the version.

Having used (by mistake) an 0.7.0 version I have seen the marked progress
and want to thank and congratulate the PDFBox community for the current
position and momentum.

[Why don't I ask the authors for XML and SVG directly? That's a different
and political issue. If anyone is interested in helping liberate the
scientific literature legally then hacking PDFs is a major strategy.
Volunteers welcome!]

>  **
>
> P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Hi Peter,

>When I use 1.7.0 NO text is written. Instead the characters are replaced by
outline glyphs using <svg:path>. The visual layout is effectively the same
as the input PDF >but there are no explicit characters.

Wow! They managed to implement it like Adobe suggested!

>I guess that in 1.7.0 NO characters are transmitted to drawString and that
everything is drawShape(), with the precomputed glyphs.

Yes, you have to trace from where g.draw(Shape) / g.fill(Shape) is coming.

Not easy task however, since paths are also drawn with same methods.

May be I'll download new version and look deeper.

Andrey

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Dienstag, 8. Mai 2012 00:59
An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org
Betreff: Re: Extracting vector graphics from PDF

Thanks

On Mon, May 7, 2012 at 11:35 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

Peter,

the parser is like a platypus - it doing not much - just parse CFF font and
create some CFF-specific objects.

As I already said, this is only half of work - I have to implement Type1
font writer to make that work.

OK 

Regarding hack, I think that PdfBox already has it.

You may get encoding and font metrics from it.

I think that's probably true.

What I really don't understand is - what is exactly does not working?

The AWT commands are being captured by SVGGraphics2D (from Batik, but also
pre-intercepted by a shell from me to capture any font stuff)

I have now run this twice, once with pdfbox-1.6.0 (from maven) and once with
pdfbox-1.7.0-SNAPSHOT.

The 1.6.0 captures the characters (e.g. "T" "h" "e"  and their coordinates
as <svg:text>. When I display them the page is laid out exactly except for
the font which is some default. 

When I use 1.7.0 NO text is written. Instead the characters are replaced by
outline glyphs using <svg:path>. The visual layout is effectively the same
as the input PDF but there are no explicit characters.

If Font is working on "normal" Graphics it should also work on your "hacked"
graphics.

So what is your problem???

My guess is that in 1.6.0 the characters are transmitted to the
g.drawString() command without the Font having been transmitted. That would
result in readable text without the correct font. Ideally i need the font
for the metrics.

I guess that in 1.7.0 NO characters are transmitted to drawString and that
everything is drawShape(), with the precomputed glyphs. However the system
should know the characters at that stage as they are known to the1.6.0
system! If I know how to get them I could combine the information and that
would be fine as I could then create the glyph table. 

I have tried another publisher - apart from the first 2 are these fonts any
better?
COSDictionary{(COSName{BaseFont}:COSName{Arial-BoldMT}) 
COSDictionary{(COSName{BaseFont}:COSName{ArialMT})
COSDictionary{(COSName{BaseFont}:COSName{FEDNDC+AdvOTbdfd27ae.B}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDNED+AdvOTb65e897d.B})
COSDictionary{(COSName{BaseFont}:COSName{FEDNEE+AdvOT1ef757c0}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDNFF+AdvP7DB7}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDNGG+AdvOT7d6df7ab.I}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDNHG+AdvP414BFB}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDNII+AdvOTc8fb9ce9}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDOMF+AdvP4DD222}) 
COSDictionary{(COSName{BaseFont}:COSName{FEDPLG+AdvOT6f8dc4dc.I})
COSDictionary{(COSName{BaseFont}:COSName{FEEAKB+AdvP3EAA99}) 
COSDictionary{(COSName{BaseFont}:COSName{FEEALC+AdvP44E6F4}) 
 f

I assume that the PDF has transmitted

Andrey

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 15:24

An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org

Betreff: Re: Extracting vector graphics from PDF

On Mon, May 7, 2012 at 1:31 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

Peter,

The COS output is horrible formatted,  so I read only first line ;-)

Sorry - that is what COSDictionary.toString() gave.

It uses FontFile3 stream.

FontFile3 stream contains font either in Compact Font Format ( CFF) or
OpenType Format (OTF) 

which are not supported by java.

The font name is "FKAJPF+AdvOT3b30f6db.B" which means that it is subset font
of font named "AdvOT3b30f6db.B".

I am ignorant about fonts so please correct any errors.

I don't know exactly how PdfBox handles CFF/OpenType fonts, probably they
just search for surrogate font (by name) or some kind default font (since I
never saw such horrible font name in system fonts).

I have no idea where the font came from. It's probably created by the
publisher or bought from a supplier. 

I don't know if this is really useful for you. 

It's very useful! First it explains why I had problems and gives me
confidence in the process.

I also have no idea why font name/style are not set.

It may be nevertheless valid font.

BTW The only way to make java understand CFF/OTF fonts is to convert them to
Type1 fonts.

I doubt that there are any free java program which could do it.

Thanks for the information. 

/ (I managed to write parser for CFF fonts, but still have to dig into Type1
font format, however my to do list is really long and Type1 format in not on
first place ;-))

What does the parser do?

Best Regards

I shall probably create a hack of some kind. I can find a san-serif and
serif which are "fairly close" and substitute them.  How would I get a
system COSDictionary I could substitute?

I am mainly interested in:
* the identity of the characters
* the font metrics of  the characters. 

In this way I can guess the words and the spaces between them.

Andrey

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069 <tel:%2B44-1223-763069> 

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

Thanks

On Mon, May 7, 2012 at 11:35 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Peter,****
>
> ** **
>
> the parser is like a platypus – it doing not much – just parse CFF font
> and create some CFF-specific objects.****
>
> As I already said, this is only half of work – I have to implement Type1
> font writer to make that work.****
>
> **
>
OK

>  **
>
> Regarding hack, I think that PdfBox already has it.****
>
> You may get encoding and font metrics from it.****
>
> **
>
I think that's probably true.

> **
>
> What I really don’t understand is - what is exactly does not working?
>

The AWT commands are being captured by SVGGraphics2D (from Batik, but also
pre-intercepted by a shell from me to capture any font stuff)

I have now run this twice, once with pdfbox-1.6.0 (from maven) and once
with pdfbox-1.7.0-SNAPSHOT.

The 1.6.0 captures the characters (e.g. "T" "h" "e"  and their coordinates
as <svg:text>. When I display them the page is laid out exactly except for
the font which is some default.

When I use 1.7.0 NO text is written. Instead the characters are replaced by
outline glyphs using <svg:path>. The visual layout is effectively the same
as the input PDF but there are no explicit characters.

> ****
>
> If Font is working on “normal” Graphics it should also work on your
> “hacked” graphics.****
>
> So what is your problem???
>

My guess is that in 1.6.0 the characters are transmitted to the
g.drawString() command without the Font having been transmitted. That would
result in readable text without the correct font. Ideally i need the font
for the metrics.

I guess that in 1.7.0 NO characters are transmitted to drawString and that
everything is drawShape(), with the precomputed glyphs. However the system
should know the characters at that stage as they are known to the1.6.0
system! If I know how to get them I could combine the information and that
would be fine as I could then create the glyph table.

I have tried another publisher - apart from the first 2 are these fonts any
better?
COSDictionary{(COSName{BaseFont}:COSName{Arial-BoldMT})
COSDictionary{(COSName{BaseFont}:COSName{ArialMT})
COSDictionary{(COSName{BaseFont}:COSName{FEDNDC+AdvOTbdfd27ae.B})
COSDictionary{(COSName{BaseFont}:COSName{FEDNED+AdvOTb65e897d.B})
COSDictionary{(COSName{BaseFont}:COSName{FEDNEE+AdvOT1ef757c0})
COSDictionary{(COSName{BaseFont}:COSName{FEDNFF+AdvP7DB7})
COSDictionary{(COSName{BaseFont}:COSName{FEDNGG+AdvOT7d6df7ab.I})
COSDictionary{(COSName{BaseFont}:COSName{FEDNHG+AdvP414BFB})
COSDictionary{(COSName{BaseFont}:COSName{FEDNII+AdvOTc8fb9ce9})
COSDictionary{(COSName{BaseFont}:COSName{FEDOMF+AdvP4DD222})
COSDictionary{(COSName{BaseFont}:COSName{FEDPLG+AdvOT6f8dc4dc.I})
COSDictionary{(COSName{BaseFont}:COSName{FEEAKB+AdvP3EAA99})
COSDictionary{(COSName{BaseFont}:COSName{FEEALC+AdvP44E6F4})
 f

I assume that the PDF has transmitted

> ****
>
> ** **
>
> Andrey****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *Von:* peter.murray.rust@googlemail.com [mailto:
> peter.murray.rust@googlemail.com] *Im Auftrag von *Peter Murray-Rust
> *Gesendet:* Montag, 7. Mai 2012 15:24
>
> *An:* Andrey Kuznetsov
> *Cc:* users@pdfbox.apache.org
> *Betreff:* Re: Extracting vector graphics from PDF****
>
> ** **
>
> ** **
>
> On Mon, May 7, 2012 at 1:31 PM, Andrey Kuznetsov <im...@gmx.de> wrote:**
> **
>
> Peter,****
>
>  ****
>
> The COS output is horrible formatted,  so I read only first line ;-)****
>
>
> Sorry - that is what COSDictionary.toString() gave.
>  ****
>
> It uses FontFile3 stream.****
>
> FontFile3 stream contains font either in Compact Font Format ( CFF) or
> OpenType Format (OTF) ****
>
> which are not supported by java.****
>
> The font name is “FKAJPF+AdvOT3b30f6db.B” which means that it is subset
> font of font named “AdvOT3b30f6db.B”.****
>
>
> I am ignorant about fonts so please correct any errors.
>  ****
>
> I don’t know exactly how PdfBox handles CFF/OpenType fonts, probably they
> just search for surrogate font (by name) or some kind default font (since I
> never saw such horrible font name in system fonts).****
>
>
> I have no idea where the font came from. It's probably created by the
> publisher or bought from a supplier. ****
>
>  ****
>
> I don’t know if this is really useful for you. ****
>
>
> It's very useful! First it explains why I had problems and gives me
> confidence in the process.
>  ****
>
> I also have no idea why font name/style are not set.****
>
> It may be nevertheless valid font.****
>
>  ****
>
> BTW The only way to make java understand CFF/OTF fonts is to convert them
> to Type1 fonts.****
>
> I doubt that there are any free java program which could do it.****
>
>
> Thanks for the information. ****
>
>  ****
>
> / (I managed to write parser for CFF fonts, but still have to dig into
> Type1 font format, however my to do list is really long and Type1 format in
> not on first place ;-))****
>
>
> What does the parser do?
>  ****
>
>  ****
>
> Best Regards****
>
>
> I shall probably create a hack of some kind. I can find a san-serif and
> serif which are "fairly close" and substitute them.  How would I get a
> system COSDictionary I could substitute?
>
> I am mainly interested in:
> * the identity of the characters
> * the font metrics of  the characters.
>
> In this way I can guess the words and the spaces between them.****
>
>  ****
>
> Andrey****
>
>  ****
>
> ** **
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069****
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Peter,

 

the parser is like a platypus - it doing not much - just parse CFF font and
create some CFF-specific objects.

As I already said, this is only half of work - I have to implement Type1
font writer to make that work.

 

Regarding hack, I think that PdfBox already has it.

You may get encoding and font metrics from it.

 

What I really don't understand is - what is exactly does not working?

If Font is working on "normal" Graphics it should also work on your "hacked"
graphics.

So what is your problem???

 

Andrey

 

 

 

 

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 15:24
An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org
Betreff: Re: Extracting vector graphics from PDF

 

 

On Mon, May 7, 2012 at 1:31 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

Peter,

 

The COS output is horrible formatted,  so I read only first line ;-)


Sorry - that is what COSDictionary.toString() gave.
 

It uses FontFile3 stream.

FontFile3 stream contains font either in Compact Font Format ( CFF) or
OpenType Format (OTF) 

which are not supported by java.

The font name is "FKAJPF+AdvOT3b30f6db.B" which means that it is subset font
of font named "AdvOT3b30f6db.B".


I am ignorant about fonts so please correct any errors.
 

I don't know exactly how PdfBox handles CFF/OpenType fonts, probably they
just search for surrogate font (by name) or some kind default font (since I
never saw such horrible font name in system fonts).


I have no idea where the font came from. It's probably created by the
publisher or bought from a supplier. 

 

I don't know if this is really useful for you. 


It's very useful! First it explains why I had problems and gives me
confidence in the process.
 

I also have no idea why font name/style are not set.

It may be nevertheless valid font.

 

BTW The only way to make java understand CFF/OTF fonts is to convert them to
Type1 fonts.

I doubt that there are any free java program which could do it.


Thanks for the information. 

 

/ (I managed to write parser for CFF fonts, but still have to dig into Type1
font format, however my to do list is really long and Type1 format in not on
first place ;-))


What does the parser do?
 

 

Best Regards


I shall probably create a hack of some kind. I can find a san-serif and
serif which are "fairly close" and substitute them.  How would I get a
system COSDictionary I could substitute?

I am mainly interested in:
* the identity of the characters
* the font metrics of  the characters. 

In this way I can guess the words and the spaces between them.

 

Andrey

 

 



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

On Mon, May 7, 2012 at 1:31 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Peter,****
>
> ** **
>
> The COS output is horrible formatted,  so I read only first line ;-)
>

Sorry - that is what COSDictionary.toString() gave.

> ****
>
> It uses FontFile3 stream.****
>
> FontFile3 stream contains font either in Compact Font Format ( CFF) or
> OpenType Format (OTF) ****
>
> which are not supported by java.****
>
> The font name is “FKAJPF+AdvOT3b30f6db.B” which means that it is subset
> font of font named “AdvOT3b30f6db.B”.
>

I am ignorant about fonts so please correct any errors.

> ****
>
> I don’t know exactly how PdfBox handles CFF/OpenType fonts, probably they
> just search for surrogate font (by name) or some kind default font (since I
> never saw such horrible font name in system fonts).
>

I have no idea where the font came from. It's probably created by the
publisher or bought from a supplier.

> I don’t know if this is really useful for you.
>

It's very useful! First it explains why I had problems and gives me
confidence in the process.

> ****
>
> I also have no idea why font name/style are not set.****
>
> It may be nevertheless valid font.****
>
> ** **
>
> BTW The only way to make java understand CFF/OTF fonts is to convert them
> to Type1 fonts.****
>
> I doubt that there are any free java program which could do it.****
>
> **
>

Thanks for the information.

>  **
>
> / (I managed to write parser for CFF fonts, but still have to dig into
> Type1 font format, however my to do list is really long and Type1 format in
> not on first place ;-))
>

What does the parser do?

> ****
>
> ** **
>
> Best Regards****
>
> **
>

I shall probably create a hack of some kind. I can find a san-serif and
serif which are "fairly close" and substitute them.  How would I get a
system COSDictionary I could substitute?

I am mainly interested in:
* the identity of the characters
* the font metrics of  the characters.

In this way I can guess the words and the spaces between them.

 **
>
> Andrey****
>
> ** **
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Peter,

 

The COS output is horrible formatted,  so I read only first line ;-)

It uses FontFile3 stream.

FontFile3 stream contains font either in Compact Font Format ( CFF) or
OpenType Format (OTF) 

which are not supported by java.

The font name is "FKAJPF+AdvOT3b30f6db.B" which means that it is subset font
of font named "AdvOT3b30f6db.B".

I don't know exactly how PdfBox handles CFF/OpenType fonts, probably they
just search for surrogate font (by name) or some kind default font (since I
never saw such horrible font name in system fonts).

 

I don't know if this is really useful for you. 

I also have no idea why font name/style are not set.

It may be nevertheless valid font.

 

BTW The only way to make java understand CFF/OTF fonts is to convert them to
Type1 fonts.

I doubt that there are any free java program which could do it.

 

/ (I managed to write parser for CFF fonts, but still have to dig into Type1
font format, however my to do list is really long and Type1 format in not on
first place ;-))

 

Best Regards

 

Andrey

 

 

 

 

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 13:15
An: Andrey Kuznetsov
Betreff: Re: Extracting vector graphics from PDF

 

Andrey,
I have added 
System.err.println(fontDictionary.toString());
to the constructor.

I attach my output; (I assume that it is too large to post to the list).

In case it helps it is the first 2 pages of:
http://www.biomedcentral.com/content/pdf/1471-2148-11-371.pd
<http://www.biomedcentral.com/content/pdf/1471-2148-11-371.pdf> f (which has
no copyright restrictions)




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

On Mon, May 7, 2012 at 11:24 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Peter,****
>
> ** **
>
> super( fontDictionary );****
>
> sets field font in superclass (PDFont).****
>
> ** **
>
> Don’t worry about “Not implemented…” thing, its implemented in extending
> classes (e.g. PDTrueTypeFont).****
>
> ** **
>
> So really I need that dump of fontDictionary to tell you more.****
>
> fontDictionary.toString() will do it.
>

I will do that! to avoid contamination I shall re-checkout pdfbox-1.6.0,
edit in the dump instruction and recompile/re-run.

Meanwhile I have rerun my code with pdfbox-1.7.0-SNAPSHOT and note that the
final SVG contains ONLY paths and no text.

many thanks..

>  **
>
>
-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Peter,

 

super( fontDictionary );

sets field font in superclass (PDFont).

 

Don't worry about "Not implemented." thing, its implemented in extending
classes (e.g. PDTrueTypeFont).

 

So really I need that dump of fontDictionary to tell you more.

fontDictionary.toString() will do it.

 

Andrey

 

 

 

 

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 12:13
An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org
Betreff: Re: Extracting vector graphics from PDF

 

 

On Mon, May 7, 2012 at 10:58 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

Hi Peter,

 

There are must be COSDictionary field called font (in PDSimpleFont).

Can you dump it and post here?


// You mean fontDictionary , I imagine?
   public PDSimpleFont( COSDictionary fontDictionary )
    {
        super( fontDictionary );
    }

    /**
    * Looks up, creates, returns  the AWT Font.
    */
    public Font getawtFont() throws IOException
    {
        log.error("Not yet implemented:" + getClass().getName() );
        return null;
    }

I notice now that this call is used in drawString so that might explain why
there is no font information 


Is it worth changing to 1.7.0??



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

On Mon, May 7, 2012 at 10:58 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Hi Peter,****
>
> ** **
>
> There are must be COSDictionary field called font (in PDSimpleFont).****
>
> Can you dump it and post here?****
>
> **
>

// You mean fontDictionary , I imagine?
   public PDSimpleFont( COSDictionary fontDictionary )
    {
        super( fontDictionary );
    }

    /**
    * Looks up, creates, returns  the AWT Font.
    */
    public Font getawtFont() throws IOException
    {
        log.error("Not yet implemented:" + getClass().getName() );
        return null;
    }

I notice now that this call is used in drawString so that might explain why
there is no font information

Is it worth changing to 1.7.0??


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Hi Peter,

 

There are must be COSDictionary field called font (in PDSimpleFont).

Can you dump it and post here?

 

Andrey

 

 

 

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 11:21
An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org
Betreff: Re: Extracting vector graphics from PDF

 

 

On Mon, May 7, 2012 at 10:10 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

Hi Peter,

 

did you tried to trace from where setFont() get called?

 

Best Regards

 

Andrey

 

Andrey - this is very helpful of you - I hope you have time to comment on
the stack trace:

PDFSVGGraphics2D is my extended SVGGraphics engine
PDF2SVGReader.writePage(PDF2SVGReader.java:115) is given below


    at
org.xmlcml.graphics.pdf2svg.PDFSVGGraphics2D.setFont(PDFSVGGraphics2D.java:8
1)
    at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.writeFont(PDSimpleFont.java:304)
    at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:114
)
    at
org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:1
94)
    at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.ja
va:494)
    at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
    at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
551)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:274)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:251)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:22
5)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:107)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.writePage(PDF2SVGReader.java:115)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.readPDFFile(PDF2SVGReader.java:99)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.main(PDF2SVGReader.java:210)
    at org.xmlcml.graphics.pdf.PDFReaderTest.testBMC(PDFReaderTest.java:16)

// writePage is hacked from PDFReader

     private void writePage(int pageNumber)
    {
        try        {
            // PMR
            PageDrawer drawer = new PageDrawer();
            PageWrapper wrapper = new PageWrapper( this );
            PDPage page = (PDPage)pages.get(pageNumber);
            wrapper.displayPage( page );
            PDRectangle cropBox = page.findCropBox();
            Dimension drawDimension = cropBox.createDimension();
            svgGraphics2D = this.createSVGGraphics();
            drawer.drawPage( svgGraphics2D, page, drawDimension );
            writeSVG(pageNumber);
        } catch (IOException exception) {
            exception.printStackTrace();
        }
    }


 

 

 

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069 <tel:%2B44-1223-763069>

Re: Extracting vector graphics from PDF

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

On Mon, May 7, 2012 at 10:10 AM, Andrey Kuznetsov <im...@gmx.de> wrote:

> Hi Peter,****
>
> ** **
>
> did you tried to trace from where setFont() get called?****
>
> ** **
>
> Best Regards****
>
> ** **
>
> Andrey****
>
>
>
Andrey - this is very helpful of you - I hope you have time to comment on
the stack trace:

PDFSVGGraphics2D is my extended SVGGraphics engine
PDF2SVGReader.writePage(PDF2SVGReader.java:115) is given below


    at
org.xmlcml.graphics.pdf2svg.PDFSVGGraphics2D.setFont(PDFSVGGraphics2D.java:81)
    at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.writeFont(PDSimpleFont.java:304)
    at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:114)
    at
org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:194)
    at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:494)
    at
org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
    at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:107)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.writePage(PDF2SVGReader.java:115)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.readPDFFile(PDF2SVGReader.java:99)
    at
org.xmlcml.graphics.pdf2svg.PDF2SVGReader.main(PDF2SVGReader.java:210)
    at org.xmlcml.graphics.pdf.PDFReaderTest.testBMC(PDFReaderTest.java:16)

// writePage is hacked from PDFReader

     private void writePage(int pageNumber)
    {
        try        {
            // PMR
            PageDrawer drawer = new PageDrawer();
            PageWrapper wrapper = new PageWrapper( this );
            PDPage page = (PDPage)pages.get(pageNumber);
            wrapper.displayPage( page );
            PDRectangle cropBox = page.findCropBox();
            Dimension drawDimension = cropBox.createDimension();
            svgGraphics2D = this.createSVGGraphics();
            drawer.drawPage( svgGraphics2D, page, drawDimension );
            writeSVG(pageNumber);
        } catch (IOException exception) {
            exception.printStackTrace();
        }
    }




> **
>
> ** **
>
> --
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

AW: Extracting vector graphics from PDF

Posted by Andrey Kuznetsov <im...@gmx.de>.

Hi Peter,

 

did you tried to trace from where setFont() get called?

 

Best Regards

 

Andrey

 

 

 

Von: peter.murray.rust@googlemail.com
[mailto:peter.murray.rust@googlemail.com] Im Auftrag von Peter Murray-Rust
Gesendet: Montag, 7. Mai 2012 11:03
An: Andrey Kuznetsov
Cc: users@pdfbox.apache.org
Betreff: Re: Extracting vector graphics from PDF

 

I have followed Andrey's recommendation and am able to analyze graphics
primitives written to MyGraphics2D object (I generally use Batik's
SVGGraphics2D as I have an SVG toolkit for analyzing the primitives). I am
having difficulty with the Font information.


On Mon, Apr 2, 2012 at 2:51 PM, Andrey Kuznetsov <im...@gmx.de> wrote:

Peter, you have to pass your own Graphics2D object (with some overridden
methods) to pdfbox.


MyGraphics2D extends Graphics2D {

    public void setFont(Font) { ...}

I am using pdfbox-1.6.0 from Maven and using my own version of PDReader
which captures the graphics. When I capture the font as above it is always:

java.awt.Font[family=null,name=null,style=plain,size=1

When using PDFReader, however, all the glyphs are correctly drawn, including
italic and a variety of fonts. I have also managed to capture graphic
information where all the characters were outlines (SVGPath) rather than
SVGText suggesting that PDFReader had written the glyphs directly. 

I'd be grateful for any pointers as to how I can capture either or both of
the Font or glyph information and what is actually happening when the
information is passed. (I am quite prepared to work with the glyphs as there
are some documents where, I think, only glyph information is provided so I
have to do some analysis there.

Peter
 

-- 

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069