You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Andisa Dewi <th...@yahoo.com> on 2017/02/27 10:32:29 UTC

Extracting vector graphics from pdf

Hello guys,

I'm currently extracting images from a whole lot of pdf files, however 
some of images (or figures) are somehow not extracted. I'm thinking it 
might have to do with the fact that those images are vector graphics (as 
usually the case in a lot of scientific papers). My question is, is it 
possible to extract vector graphics from pdfs using Tika?

I attached an example of the pdf (here for example, all images are 
extracted except Figure 2).

The way I'm extracting the images are the same as in the example code:

Parser parser = new AutoDetectParser();
Metadata m = new Metadata();
ParseContext c = new ParseContext();
ContentHandler h = new BodyContentHandler(-1);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
c.set(PDFParserConfig.class, pdfConfig);
c.set(Parser.class, parser);
EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c);
c.set(EmbeddedDocumentExtractor.class, ex);
parser.parse(inputstream, h, m, c);


Thanks!

Regards,

Eli

Re: Extracting vector graphics from pdf

Posted by Manuel Aristarán <ma...@jazzido.com>.

> On Feb 27, 2017, at 7:20 AM, Allison, Timothy B. <ta...@mitre.org> wrote:
> I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?


We do some of that in Tabula: https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/ObjectExtractor.java#L181-L270 <https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/ObjectExtractor.java#L181-L270>



—
Manuel Aristarán <ma...@jazzido.com>
http://jazzido.com

Re: Extracting vector graphics from pdf

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

We (ContentMine.org) also extract vectors from PDF using PDFBox (at present
1.8, but we shall upgrade). Our emphasis is on diagrams including plots,
graphs, charts, and some specialised areas such as chemistry.

The code is in https://github.com/ContentMine/pdf2svg and
https://github.com/ContentMine/svg2xml. (it used to be on Bitbucket/hg)
This is under daily revision.
This is a 2-step process - first extract all the characters and vectors and
normalize coordinates and other features. Then build more complex
primitives such as polygons and circles, and then build up to textboxes,
phrases, subscripts, arrows and other higher level objects. Then this is
stitched together into plots, chemical structures, etc. Everything is Open
Source Apache 2

I have been using PDFBox for about 7 years now and would also like to give
wholehearted thanks to the team. When it started there were 1-2 mails /
week and now you have built a vibrant community including a StackOverflow
resource.

Our goal is to read the whole scientific literature and make it free and
we'd welcome others who are also interested in that.

Peter

On Tue, Feb 28, 2017 at 1:32 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thank you, Tilman!
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Monday, February 27, 2017 9:38 AM
> To: users@pdfbox.apache.org
> Cc: user@tika.apache.org
> Subject: Re: Extracting vector graphics from pdf
>
> http://stackoverflow.com/a/38933039/535646
>
> This allows to collect the lines. However it won't output an image.
>
> Tilman
>
> Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> > PDFBox Colleagues,
> >    Any recommendations?
> >
> >            Best,
> >
> >                   Tim
> >
> > -----Original Message-----
> > From: Andisa Dewi [mailto:theknights91@yahoo.com]
> > Sent: Monday, February 27, 2017 5:32 AM
> > To: user@tika.apache.org
> > Subject: Extracting vector graphics from pdf
> >
> > Hello guys,
> >
> > I'm currently extracting images from a whole lot of pdf files, however
> some of images (or figures) are somehow not extracted. I'm thinking it
> might have to do with the fact that those images are vector graphics (as
> usually the case in a lot of scientific papers). My question is, is it
> possible to extract vector graphics from pdfs using Tika?
> >
> > I attached an example of the pdf (here for example, all images are
> extracted except Figure 2).
> >
> > The way I'm extracting the images are the same as in the example code:
> >
> > Parser parser = new AutoDetectParser(); Metadata m = new Metadata();
> > ParseContext c = new ParseContext(); ContentHandler h = new
> > BodyContentHandler(-1); PDFParserConfig pdfConfig = new
> > PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
> > c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser);
> > EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c);
> > c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream,
> > h, m, c);
> >
> >
> > Thanks!
> >
> > Regards,
> >
> > Eli
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

-- 
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

RE: Extracting vector graphics from pdf

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thank you, Tilman!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Monday, February 27, 2017 9:38 AM
To: users@pdfbox.apache.org
Cc: user@tika.apache.org
Subject: Re: Extracting vector graphics from pdf

http://stackoverflow.com/a/38933039/535646

This allows to collect the lines. However it won't output an image.

Tilman

Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>    Any recommendations?
>
>            Best,
>
>                   Tim
>
> -----Original Message-----
> From: Andisa Dewi [mailto:theknights91@yahoo.com]
> Sent: Monday, February 27, 2017 5:32 AM
> To: user@tika.apache.org
> Subject: Extracting vector graphics from pdf
>
> Hello guys,
>
> I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?
>
> I attached an example of the pdf (here for example, all images are extracted except Figure 2).
>
> The way I'm extracting the images are the same as in the example code:
>
> Parser parser = new AutoDetectParser(); Metadata m = new Metadata(); 
> ParseContext c = new ParseContext(); ContentHandler h = new 
> BodyContentHandler(-1); PDFParserConfig pdfConfig = new 
> PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
> c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); 
> EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); 
> c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, 
> h, m, c);
>
>
> Thanks!
>
> Regards,
>
> Eli
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

RE: Extracting vector graphics from pdf

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thank you, Tilman!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Monday, February 27, 2017 9:38 AM
To: users@pdfbox.apache.org
Cc: user@tika.apache.org
Subject: Re: Extracting vector graphics from pdf

http://stackoverflow.com/a/38933039/535646

This allows to collect the lines. However it won't output an image.

Tilman

Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>    Any recommendations?
>
>            Best,
>
>                   Tim
>
> -----Original Message-----
> From: Andisa Dewi [mailto:theknights91@yahoo.com]
> Sent: Monday, February 27, 2017 5:32 AM
> To: user@tika.apache.org
> Subject: Extracting vector graphics from pdf
>
> Hello guys,
>
> I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?
>
> I attached an example of the pdf (here for example, all images are extracted except Figure 2).
>
> The way I'm extracting the images are the same as in the example code:
>
> Parser parser = new AutoDetectParser(); Metadata m = new Metadata(); 
> ParseContext c = new ParseContext(); ContentHandler h = new 
> BodyContentHandler(-1); PDFParserConfig pdfConfig = new 
> PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
> c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); 
> EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); 
> c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, 
> h, m, c);
>
>
> Thanks!
>
> Regards,
>
> Eli
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Extracting vector graphics from pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

http://stackoverflow.com/a/38933039/535646

This allows to collect the lines. However it won't output an image.

Tilman

Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>    Any recommendations?
>
>            Best,
>
>                   Tim
>
> -----Original Message-----
> From: Andisa Dewi [mailto:theknights91@yahoo.com]
> Sent: Monday, February 27, 2017 5:32 AM
> To: user@tika.apache.org
> Subject: Extracting vector graphics from pdf
>
> Hello guys,
>
> I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?
>
> I attached an example of the pdf (here for example, all images are extracted except Figure 2).
>
> The way I'm extracting the images are the same as in the example code:
>
> Parser parser = new AutoDetectParser();
> Metadata m = new Metadata();
> ParseContext c = new ParseContext();
> ContentHandler h = new BodyContentHandler(-1); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
> c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, h, m, c);
>
>
> Thanks!
>
> Regards,
>
> Eli
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

RE: Extracting vector graphics from pdf

Posted by "Allison, Timothy B." <ta...@mitre.org>.

PDFBox Colleagues,
  Any recommendations?

          Best,

                 Tim

-----Original Message-----
From: Andisa Dewi [mailto:theknights91@yahoo.com] 
Sent: Monday, February 27, 2017 5:32 AM
To: user@tika.apache.org
Subject: Extracting vector graphics from pdf

Hello guys,

I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?

I attached an example of the pdf (here for example, all images are extracted except Figure 2).

The way I'm extracting the images are the same as in the example code:

Parser parser = new AutoDetectParser();
Metadata m = new Metadata();
ParseContext c = new ParseContext();
ContentHandler h = new BodyContentHandler(-1); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, h, m, c);


Thanks!

Regards,

Eli

RE: Extracting vector graphics from pdf

Posted by "Allison, Timothy B." <ta...@mitre.org>.

PDFBox Colleagues,
  Any recommendations?

          Best,

                 Tim

-----Original Message-----
From: Andisa Dewi [mailto:theknights91@yahoo.com] 
Sent: Monday, February 27, 2017 5:32 AM
To: user@tika.apache.org
Subject: Extracting vector graphics from pdf

Hello guys,

I'm currently extracting images from a whole lot of pdf files, however some of images (or figures) are somehow not extracted. I'm thinking it might have to do with the fact that those images are vector graphics (as usually the case in a lot of scientific papers). My question is, is it possible to extract vector graphics from pdfs using Tika?

I attached an example of the pdf (here for example, all images are extracted except Figure 2).

The way I'm extracting the images are the same as in the example code:

Parser parser = new AutoDetectParser();
Metadata m = new Metadata();
ParseContext c = new ParseContext();
ContentHandler h = new BodyContentHandler(-1); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, h, m, c);


Thanks!

Regards,

Eli