You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Stefan Falk <s....@student.tugraz.at> on 2015/01/14 21:42:50 UTC

Extract underlying PDF code from PDF file by selecting an area

Hello pdfbox people!

I was wondering if anybody can help me with my needs. What I am looking 
for is a possibility to extract the underlying PDF code from a PDF file 
by simply selecting an area with your mouse.

After reading a few things about PDFs I have learned that anything that 
has to do with extraction anything from a PDF can be a quite hard task.

So I was wondering if pdfbox could do that somehow. I've taken a rough 
look at the PDFReader and I noticed that there is e.g. 
processTextPosition from the class PageDrawer that seem to allow me to 
get at least the position from Text - am I right in assuming that?

My concrete question would be what is possible with pdfbox regarding 
this matter? E.g. I have a PDF on my drive which text seems to be 
"extractable" by pdfbox on the one hand but on the other hand the 
PDFReader is not able to render any of it. It just renders the images 
(see attachment).

Thank you for your help in advance!

Best regards,
Stefan

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by John Hewson <jo...@jahewson.com>.

Yes, PDFBox can do this.

-- John

> On 14 Jan 2015, at 23:48, Stefan Falk <s....@student.tugraz.at> wrote:
> 
> Hi John!
> 
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
> 
> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
> 
> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
> 
> Best regards,
> Stefan
> 
>> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>> 
>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>> 
>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>> 
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>> 
>>> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
>>> 
>>> 
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>> 
>>>>> Hello pdfbox people!
>>>>> 
>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>> 
>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>> 
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>> 
>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>> 
>>>>> Thank you for your help in advance!
>>>>> 
>>>>> Best regards,
>>>>> Stefan
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

You're welcome - and yes we are always interested to get a hand on files which can not be rendered correctly.
Please try to open them using Adobe Reader/Acrobat too just to get an idea how they are processed there. Sometimes we get PDFs that are so corrupted that there is not a lot we can do about it. 

For all usage questions the users mailing list is fine. If you are sure or think you found a bug please open an issue at https://issues.apache.org/jira/browse/PDFBOX with a test case to reproduce the issue and the PDF in question attached. If you have an idea how to overcome the issue you can also attach a patch for us to review.

Good luck with your project and feel free to ask additional questions as they arise.

BR
Maruan


Am 15.01.2015 um 09:18 schrieb Stefan Falk <s....@student.tugraz.at>:

> This is awesome! Thank you!
> 
> I will take a close look at it and update to the trunk version too.
> 
> Do you want me to report PDFs that could not be displayed correctly in the future?
> 
> Best regards,
> Stefan
> 
> On 2015-01-15 09:03, Maruan Sahyoun wrote:
>> Hi Stefan,
>> 
>> yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use
>> 
>> PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
>> As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.
>> 
>> Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.
>> 
>> BR
>> Maruan
>> 
>> Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:
>> 
>>> Hi John!
>>> 
>>> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>>> 
>>> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>>> 
>>> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>>> 
>>> Best regards,
>>> Stefan
>>> 
>>> On 2015-01-15 03:21, John Hewson wrote:
>>>> Hi Stefan
>>>> 
>>>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>>> 
>>>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>>> 
>>>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>>> 
>>>> Thanks
>>>> 
>>>> -- John
>>>> 
>>>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>>> 
>>>>> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
>>>>> 
>>>>> 
>>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>>>> what would you like to do with that content?
>>>>>> 
>>>>>> BR
>>>>>> Maruan
>>>>>> 
>>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>>> 
>>>>>>> Hello pdfbox people!
>>>>>>> 
>>>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>>> 
>>>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>>> 
>>>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>>> 
>>>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>>> 
>>>>>>> Thank you for your help in advance!
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Stefan
>> 
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Stefan Falk <s....@student.tugraz.at>.

This is awesome! Thank you!

I will take a close look at it and update to the trunk version too.

Do you want me to report PDFs that could not be displayed correctly in 
the future?

Best regards,
Stefan

On 2015-01-15 09:03, Maruan Sahyoun wrote:
> Hi Stefan,
>
> yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use
>
> PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
> As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.
>
> Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.
>
> BR
> Maruan
>
> Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hi John!
>>
>> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>>
>> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>>
>> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>>
>> Best regards,
>> Stefan
>>
>> On 2015-01-15 03:21, John Hewson wrote:
>>> Hi Stefan
>>>
>>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>>
>>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>>
>>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>>
>>> Thanks
>>>
>>> -- John
>>>
>>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>>
>>>> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
>>>>
>>>>
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>>> what would you like to do with that content?
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>>
>>>>>> Hello pdfbox people!
>>>>>>
>>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>>
>>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>>
>>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>>
>>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>>
>>>>>> Thank you for your help in advance!
>>>>>>
>>>>>> Best regards,
>>>>>> Stefan
>

Re: Problems building the project

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 18.01.2015 um 20:33 schrieb Stefan Falk:
> junit.framework.AssertionFailedError: JCE unlimited strength jurisdiction policy files are not installed

https://www.google.de/search?q=JCE+unlimited+strength+jurisdiction+policy+files

Re: Problems building the project

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

Can't tell about the eclipse problems (I use netbeans), but re: strong 
encryption - could it be that you have several JDK versions on your system?

Another possible cause - I'm not 100% sure of this - is the question on 
whether to install it in the jre lib/security or in the jdk jre 
lib/security dir. It probably depends of what is used.

http://docs.oracle.com/cd/E19398-01/820-1228/agfik/index.html
"Where, <java-home> is the JRE directory within your Java Development 
Kit (JDK) environment, or the top-level directory of the JRE. "

Tilman

Am 18.01.2015 um 22:19 schrieb Stefan Falk:
> Hm, I don't get it. The requested files are already present on my 
> system. And I can even run the JUnit tests in Eclipse for the 
> encryption package successfully without any fails.
>
> It only fails if I run "mvn clean install" manually.
>
> I am rather concerned about these errors I am getting in Eclipse
>
> > Error(s) found in manifest configuration 
> (org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package) 
> pom.xml
>
> and
>
> > Plugin execution not covered by lifecycle configuration: 
> org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc, 
> phase: generate-sources)
> > Plugin execution not covered by lifecycle configuration: 
> com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget 
> (execution: get-isartor, phase: generate-test-resources)
>
> any ideas why I get these errors?
>
> Best regards,
> Stefan
>
>
>
>
>
> On 2015-01-18 21:08, Maruan Sahyoun wrote:
>> Hello Stefan,
>>
>> please find the dependencies listed here 
>> https://pdfbox.apache.org/2.0/dependencies.html. You're missing the 
>> "unlimited strength" cryptography
>>
>> BR
>> Maruan
>>
>> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>>
>>> Hi!
>>>
>>> I really get quite a list of errors when I check out the trunk as 
>>> Maven project using Eclipse Luna and the latest m2e Maven plugin 
>>> (see screenshot).
>>>
>>> I am not sure if I am missing a plugin or if I am using somewhere 
>>> the wrong version of a plugin.
>>>
>>> I've tried to do it manually by calling "mvn clean install" but this 
>>> fails too (see maven.log).
>>>
>>> Any help would be appreciated! Thank you!
>>>
>>> Best regards,
>>> Stefan
>>> <maven.log>
>

Re: Problems building the project

Posted by John Hewson <jo...@jahewson.com>.

On 18 Jan 2015, at 13:19, Stefan Falk <s....@student.tugraz.at> wrote:
> 
> Hm, I don't get it. The requested files are already present on my system. And I can even run the JUnit tests in Eclipse for the encryption package successfully without any fails.

Sounds like you have more than one JRE installed, try running the org.apache.pdfbox.util.TestRendering test from Eclipse and see what it prints out, it should log your JDK and version, e.g.

JDK: Java(TM) SE Runtime Environment
Version: 1.8

The above version is what appears in your attached Maven log from the previous mail.

— John

> It only fails if I run "mvn clean install" manually.
> 
> I am rather concerned about these errors I am getting in Eclipse
> 
> > Error(s) found in manifest configuration (org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package) pom.xml

I’m not sure where this error is coming from, if it’s M2E then it’s not our problem. If it’s Maven then we might want to look at updating the felix plugin. However, that plugin is for creating OSGi bundles, so I doubt it’s the cause of your other build problem.

> and
> 
> > Plugin execution not covered by lifecycle configuration: org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc, phase: generate-sources)
> > Plugin execution not covered by lifecycle configuration: com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget (execution: get-isartor, phase: generate-test-resources)
> 
> any ideas why I get these errors?

These errors are specific to M2E and its configuration and not related to the Maven build itself.

> Best regards,
> Stefan
> 
> 
> 
> 
> 
> On 2015-01-18 21:08, Maruan Sahyoun wrote:
>> Hello Stefan,
>> 
>> please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography
>> 
>> BR
>> Maruan
>> 
>> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>> 
>>> Hi!
>>> 
>>> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
>>> 
>>> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
>>> 
>>> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
>>> 
>>> Any help would be appreciated! Thank you!
>>> 
>>> Best regards,
>>> Stefan
>>> <maven.log>
>

Re: Problems building the project

Posted by Stefan Falk <s....@student.tugraz.at>.

Hm, I don't get it. The requested files are already present on my 
system. And I can even run the JUnit tests in Eclipse for the encryption 
package successfully without any fails.

It only fails if I run "mvn clean install" manually.

I am rather concerned about these errors I am getting in Eclipse

 > Error(s) found in manifest configuration 
(org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package) 
pom.xml

and

 > Plugin execution not covered by lifecycle configuration: 
org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc, 
phase: generate-sources)
 > Plugin execution not covered by lifecycle configuration: 
com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget 
(execution: get-isartor, phase: generate-test-resources)

any ideas why I get these errors?

Best regards,
Stefan





On 2015-01-18 21:08, Maruan Sahyoun wrote:
> Hello Stefan,
>
> please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography
>
> BR
> Maruan
>
> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hi!
>>
>> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
>>
>> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
>>
>> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
>>
>> Any help would be appreciated! Thank you!
>>
>> Best regards,
>> Stefan
>> <maven.log>

Re: Problems building the project

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hello Stefan,

please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography

BR
Maruan

Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:

> Hi!
> 
> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
> 
> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
> 
> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
> 
> Any help would be appreciated! Thank you!
> 
> Best regards,
> Stefan
> <maven.log>

Problems building the project

Posted by Stefan Falk <s....@student.tugraz.at>.

Hi!

I really get quite a list of errors when I check out the trunk as Maven 
project using Eclipse Luna and the latest m2e Maven plugin (see 
screenshot).

I am not sure if I am missing a plugin or if I am using somewhere the 
wrong version of a plugin.

I've tried to do it manually by calling "mvn clean install" but this 
fails too (see maven.log).

Any help would be appreciated! Thank you!

Best regards,
Stefan

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Stefan,

yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use 

PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.

Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.

BR
Maruan

Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:

> Hi John!
> 
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
> 
> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
> 
> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
> 
> Best regards,
> Stefan
> 
> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>> 
>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>> 
>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>> 
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>> 
>>> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
>>> 
>>> 
>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>> 
>>>>> Hello pdfbox people!
>>>>> 
>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>> 
>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>> 
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>> 
>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>> 
>>>>> Thank you for your help in advance!
>>>>> 
>>>>> Best regards,
>>>>> Stefan
>> 
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Stefan Falk <s....@student.tugraz.at>.

Hi John!

Yes, clipping the PDF is basically what I would like to do! So would 
pdfbox the best choice for this? I have looked a lot for a library but 
it does not seem that there are many open source tools out there.

My target is a program that allows to clip PDFs in order to create a 
composed PDF out of all the clips and maybe you could tell me if pdfbox 
would be the best choice for such a task.

@fairly difficult: Well yes, I was quite astonished to find out that 
extracting content from a PDF is actually a scientific topic :D

Best regards,
Stefan

On 2015-01-15 03:21, John Hewson wrote:
> Hi Stefan
>
> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>
> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>
> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>
> Thanks
>
> -- John
>
>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>
>> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
>>
>>
>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>> what would you like to do with that content?
>>>
>>> BR
>>> Maruan
>>>
>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>
>>>> Hello pdfbox people!
>>>>
>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>
>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>
>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>
>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>
>>>> Thank you for your help in advance!
>>>>
>>>> Best regards,
>>>> Stefan
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by John Hewson <jo...@jahewson.com>.

Hi Stefan

What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.

If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.

If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.

Thanks

-- John

> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
> 
> Well, basically just extract it to load it into another PDF  but it should be possible e.g. with the mouse.
> 
> 
> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>> what would you like to do with that content?
>> 
>> BR
>> Maruan
>> 
>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>> 
>>> Hello pdfbox people!
>>> 
>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>> 
>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>> 
>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>> 
>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>> 
>>> Thank you for your help in advance!
>>> 
>>> Best regards,
>>> Stefan
>> 
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Stefan Falk <s....@student.tugraz.at>.

Well, basically just extract it to load it into another PDF  but it 
should be possible e.g. with the mouse.


On 2015-01-14 22:52, Maruan Sahyoun wrote:
> what would you like to do with that content?
>
> BR
> Maruan
>
> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hello pdfbox people!
>>
>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>
>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>
>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>
>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>
>> Thank you for your help in advance!
>>
>> Best regards,
>> Stefan
>

Re: Extract underlying PDF code from PDF file by selecting an area

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

what would you like to do with that content?

BR
Maruan

Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:

> Hello pdfbox people!
> 
> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
> 
> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
> 
> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
> 
> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
> 
> Thank you for your help in advance!
> 
> Best regards,
> Stefan