You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Stefan Falk <s....@student.tugraz.at> on 2015/01/14 21:42:50 UTC
Extract underlying PDF code from PDF file by selecting an area
Hello pdfbox people!
I was wondering if anybody can help me with my needs. What I am looking
for is a possibility to extract the underlying PDF code from a PDF file
by simply selecting an area with your mouse.
After reading a few things about PDFs I have learned that anything that
has to do with extraction anything from a PDF can be a quite hard task.
So I was wondering if pdfbox could do that somehow. I've taken a rough
look at the PDFReader and I noticed that there is e.g.
processTextPosition from the class PageDrawer that seem to allow me to
get at least the position from Text - am I right in assuming that?
My concrete question would be what is possible with pdfbox regarding
this matter? E.g. I have a PDF on my drive which text seems to be
"extractable" by pdfbox on the one hand but on the other hand the
PDFReader is not able to render any of it. It just renders the images
(see attachment).
Thank you for your help in advance!
Best regards,
Stefan
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by John Hewson <jo...@jahewson.com>.
Yes, PDFBox can do this.
-- John
> On 14 Jan 2015, at 23:48, Stefan Falk <s....@student.tugraz.at> wrote:
>
> Hi John!
>
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>
> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>
> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>
> Best regards,
> Stefan
>
>> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>>
>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>
>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>
>> Thanks
>>
>> -- John
>>
>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>
>>> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>>>
>>>
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>>
>>>> BR
>>>> Maruan
>>>>
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>>
>>>>> Hello pdfbox people!
>>>>>
>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>
>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>
>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>
>>>>> Thank you for your help in advance!
>>>>>
>>>>> Best regards,
>>>>> Stefan
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
You're welcome - and yes we are always interested to get a hand on files which can not be rendered correctly.
Please try to open them using Adobe Reader/Acrobat too just to get an idea how they are processed there. Sometimes we get PDFs that are so corrupted that there is not a lot we can do about it.
For all usage questions the users mailing list is fine. If you are sure or think you found a bug please open an issue at https://issues.apache.org/jira/browse/PDFBOX with a test case to reproduce the issue and the PDF in question attached. If you have an idea how to overcome the issue you can also attach a patch for us to review.
Good luck with your project and feel free to ask additional questions as they arise.
BR
Maruan
Am 15.01.2015 um 09:18 schrieb Stefan Falk <s....@student.tugraz.at>:
> This is awesome! Thank you!
>
> I will take a close look at it and update to the trunk version too.
>
> Do you want me to report PDFs that could not be displayed correctly in the future?
>
> Best regards,
> Stefan
>
> On 2015-01-15 09:03, Maruan Sahyoun wrote:
>> Hi Stefan,
>>
>> yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use
>>
>> PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
>> As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.
>>
>> Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.
>>
>> BR
>> Maruan
>>
>> Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:
>>
>>> Hi John!
>>>
>>> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>>>
>>> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>>>
>>> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>>>
>>> Best regards,
>>> Stefan
>>>
>>> On 2015-01-15 03:21, John Hewson wrote:
>>>> Hi Stefan
>>>>
>>>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>>>
>>>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>>>
>>>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>>>
>>>> Thanks
>>>>
>>>> -- John
>>>>
>>>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>>>
>>>>> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>>>>>
>>>>>
>>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>>>> what would you like to do with that content?
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>>>
>>>>>>> Hello pdfbox people!
>>>>>>>
>>>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>>>
>>>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>>>
>>>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>>>
>>>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>>>
>>>>>>> Thank you for your help in advance!
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Stefan
>>
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Stefan Falk <s....@student.tugraz.at>.
This is awesome! Thank you!
I will take a close look at it and update to the trunk version too.
Do you want me to report PDFs that could not be displayed correctly in
the future?
Best regards,
Stefan
On 2015-01-15 09:03, Maruan Sahyoun wrote:
> Hi Stefan,
>
> yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use
>
> PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
> As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.
>
> Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.
>
> BR
> Maruan
>
> Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hi John!
>>
>> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>>
>> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>>
>> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>>
>> Best regards,
>> Stefan
>>
>> On 2015-01-15 03:21, John Hewson wrote:
>>> Hi Stefan
>>>
>>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>>
>>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>>
>>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>>
>>> Thanks
>>>
>>> -- John
>>>
>>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>>
>>>> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>>>>
>>>>
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>>> what would you like to do with that content?
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>>
>>>>>> Hello pdfbox people!
>>>>>>
>>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>>
>>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>>
>>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>>
>>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>>
>>>>>> Thank you for your help in advance!
>>>>>>
>>>>>> Best regards,
>>>>>> Stefan
>
Re: Problems building the project
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 18.01.2015 um 20:33 schrieb Stefan Falk:
> junit.framework.AssertionFailedError: JCE unlimited strength jurisdiction policy files are not installed
https://www.google.de/search?q=JCE+unlimited+strength+jurisdiction+policy+files
Re: Problems building the project
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
Can't tell about the eclipse problems (I use netbeans), but re: strong
encryption - could it be that you have several JDK versions on your system?
Another possible cause - I'm not 100% sure of this - is the question on
whether to install it in the jre lib/security or in the jdk jre
lib/security dir. It probably depends of what is used.
http://docs.oracle.com/cd/E19398-01/820-1228/agfik/index.html
"Where, <java-home> is the JRE directory within your Java Development
Kit (JDK) environment, or the top-level directory of the JRE. "
Tilman
Am 18.01.2015 um 22:19 schrieb Stefan Falk:
> Hm, I don't get it. The requested files are already present on my
> system. And I can even run the JUnit tests in Eclipse for the
> encryption package successfully without any fails.
>
> It only fails if I run "mvn clean install" manually.
>
> I am rather concerned about these errors I am getting in Eclipse
>
> > Error(s) found in manifest configuration
> (org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package)
> pom.xml
>
> and
>
> > Plugin execution not covered by lifecycle configuration:
> org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc,
> phase: generate-sources)
> > Plugin execution not covered by lifecycle configuration:
> com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget
> (execution: get-isartor, phase: generate-test-resources)
>
> any ideas why I get these errors?
>
> Best regards,
> Stefan
>
>
>
>
>
> On 2015-01-18 21:08, Maruan Sahyoun wrote:
>> Hello Stefan,
>>
>> please find the dependencies listed here
>> https://pdfbox.apache.org/2.0/dependencies.html. You're missing the
>> "unlimited strength" cryptography
>>
>> BR
>> Maruan
>>
>> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>>
>>> Hi!
>>>
>>> I really get quite a list of errors when I check out the trunk as
>>> Maven project using Eclipse Luna and the latest m2e Maven plugin
>>> (see screenshot).
>>>
>>> I am not sure if I am missing a plugin or if I am using somewhere
>>> the wrong version of a plugin.
>>>
>>> I've tried to do it manually by calling "mvn clean install" but this
>>> fails too (see maven.log).
>>>
>>> Any help would be appreciated! Thank you!
>>>
>>> Best regards,
>>> Stefan
>>> <maven.log>
>
Re: Problems building the project
Posted by John Hewson <jo...@jahewson.com>.
On 18 Jan 2015, at 13:19, Stefan Falk <s....@student.tugraz.at> wrote:
>
> Hm, I don't get it. The requested files are already present on my system. And I can even run the JUnit tests in Eclipse for the encryption package successfully without any fails.
Sounds like you have more than one JRE installed, try running the org.apache.pdfbox.util.TestRendering test from Eclipse and see what it prints out, it should log your JDK and version, e.g.
JDK: Java(TM) SE Runtime Environment
Version: 1.8
The above version is what appears in your attached Maven log from the previous mail.
— John
> It only fails if I run "mvn clean install" manually.
>
> I am rather concerned about these errors I am getting in Eclipse
>
> > Error(s) found in manifest configuration (org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package) pom.xml
I’m not sure where this error is coming from, if it’s M2E then it’s not our problem. If it’s Maven then we might want to look at updating the felix plugin. However, that plugin is for creating OSGi bundles, so I doubt it’s the cause of your other build problem.
> and
>
> > Plugin execution not covered by lifecycle configuration: org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc, phase: generate-sources)
> > Plugin execution not covered by lifecycle configuration: com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget (execution: get-isartor, phase: generate-test-resources)
>
> any ideas why I get these errors?
These errors are specific to M2E and its configuration and not related to the Maven build itself.
> Best regards,
> Stefan
>
>
>
>
>
> On 2015-01-18 21:08, Maruan Sahyoun wrote:
>> Hello Stefan,
>>
>> please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography
>>
>> BR
>> Maruan
>>
>> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>>
>>> Hi!
>>>
>>> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
>>>
>>> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
>>>
>>> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
>>>
>>> Any help would be appreciated! Thank you!
>>>
>>> Best regards,
>>> Stefan
>>> <maven.log>
>
Re: Problems building the project
Posted by Stefan Falk <s....@student.tugraz.at>.
Hm, I don't get it. The requested files are already present on my
system. And I can even run the JUnit tests in Eclipse for the encryption
package successfully without any fails.
It only fails if I run "mvn clean install" manually.
I am rather concerned about these errors I am getting in Eclipse
> Error(s) found in manifest configuration
(org.apache.felix:maven-bundle-plugin:2.4.0:bundle:default-bundle:package)
pom.xml
and
> Plugin execution not covered by lifecycle configuration:
org.codehaus.mojo:javacc-maven-plugin:2.6:javacc (execution: javacc,
phase: generate-sources)
> Plugin execution not covered by lifecycle configuration:
com.googlecode.maven-download-plugin:maven-download-plugin:1.1.0:wget
(execution: get-isartor, phase: generate-test-resources)
any ideas why I get these errors?
Best regards,
Stefan
On 2015-01-18 21:08, Maruan Sahyoun wrote:
> Hello Stefan,
>
> please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography
>
> BR
> Maruan
>
> Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hi!
>>
>> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
>>
>> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
>>
>> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
>>
>> Any help would be appreciated! Thank you!
>>
>> Best regards,
>> Stefan
>> <maven.log>
Re: Problems building the project
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hello Stefan,
please find the dependencies listed here https://pdfbox.apache.org/2.0/dependencies.html. You're missing the "unlimited strength" cryptography
BR
Maruan
Am 18.01.2015 um 20:33 schrieb Stefan Falk <s....@student.tugraz.at>:
> Hi!
>
> I really get quite a list of errors when I check out the trunk as Maven project using Eclipse Luna and the latest m2e Maven plugin (see screenshot).
>
> I am not sure if I am missing a plugin or if I am using somewhere the wrong version of a plugin.
>
> I've tried to do it manually by calling "mvn clean install" but this fails too (see maven.log).
>
> Any help would be appreciated! Thank you!
>
> Best regards,
> Stefan
> <maven.log>
Problems building the project
Posted by Stefan Falk <s....@student.tugraz.at>.
Hi!
I really get quite a list of errors when I check out the trunk as Maven
project using Eclipse Luna and the latest m2e Maven plugin (see
screenshot).
I am not sure if I am missing a plugin or if I am using somewhere the
wrong version of a plugin.
I've tried to do it manually by calling "mvn clean install" but this
fails too (see maven.log).
Any help would be appreciated! Thank you!
Best regards,
Stefan
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Stefan,
yes, PDFBox is capable of doing this. To crop the page to the dimensions you need you can use
PDPage.setCropBox [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
As John pointed out, the SuperimposePage example will give you the basics to import and 'mount' the page into a new or existing PDF.
Only thing is to get the coordinates from the mouse and translate that to the dimensions for the rectangle in PDF.
BR
Maruan
Am 15.01.2015 um 08:48 schrieb Stefan Falk <s....@student.tugraz.at>:
> Hi John!
>
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox the best choice for this? I have looked a lot for a library but it does not seem that there are many open source tools out there.
>
> My target is a program that allows to clip PDFs in order to create a composed PDF out of all the clips and maybe you could tell me if pdfbox would be the best choice for such a task.
>
> @fairly difficult: Well yes, I was quite astonished to find out that extracting content from a PDF is actually a scientific topic :D
>
> Best regards,
> Stefan
>
> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>>
>> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>>
>> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>>
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>>
>> Thanks
>>
>> -- John
>>
>>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>>
>>> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>>>
>>>
>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>>
>>>> BR
>>>> Maruan
>>>>
>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>>
>>>>> Hello pdfbox people!
>>>>>
>>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>>
>>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>>
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>>
>>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>>
>>>>> Thank you for your help in advance!
>>>>>
>>>>> Best regards,
>>>>> Stefan
>>
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Stefan Falk <s....@student.tugraz.at>.
Hi John!
Yes, clipping the PDF is basically what I would like to do! So would
pdfbox the best choice for this? I have looked a lot for a library but
it does not seem that there are many open source tools out there.
My target is a program that allows to clip PDFs in order to create a
composed PDF out of all the clips and maybe you could tell me if pdfbox
would be the best choice for such a task.
@fairly difficult: Well yes, I was quite astonished to find out that
extracting content from a PDF is actually a scientific topic :D
Best regards,
Stefan
On 2015-01-15 03:21, John Hewson wrote:
> Hi Stefan
>
> What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
>
> If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
>
> If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
>
> Thanks
>
> -- John
>
>> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>>
>> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>>
>>
>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>> what would you like to do with that content?
>>>
>>> BR
>>> Maruan
>>>
>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>>
>>>> Hello pdfbox people!
>>>>
>>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>>
>>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>>
>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>>
>>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>>
>>>> Thank you for your help in advance!
>>>>
>>>> Best regards,
>>>> Stefan
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by John Hewson <jo...@jahewson.com>.
Hi Stefan
What you’re describing is actually fairly difficult due to the complexity of the PDF operators, we have a special processor for text in PDFBox, but it is not necessarily accurate.
If you’re just trying to embed pages from existing PDFs into new PDFs then the SuperimposePage example which comes with PDFBox might already serve your needs. If you specify a custom BBox for the FormXObject, then you can use that to clip the page - which sounds like what you want. Please note that this technique still embeds all of the original page contents, so its not suitable for removing private or sensitive data, but otherwise it’s fine.
If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk version of PDFBox, where we have fixed many bugs.
Thanks
-- John
> On 14 Jan 2015, at 15:14, Stefan Falk <s....@student.tugraz.at> wrote:
>
> Well, basically just extract it to load it into another PDF but it should be possible e.g. with the mouse.
>
>
> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>> what would you like to do with that content?
>>
>> BR
>> Maruan
>>
>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>>
>>> Hello pdfbox people!
>>>
>>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>>
>>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>>
>>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>>
>>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>>
>>> Thank you for your help in advance!
>>>
>>> Best regards,
>>> Stefan
>>
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Stefan Falk <s....@student.tugraz.at>.
Well, basically just extract it to load it into another PDF but it
should be possible e.g. with the mouse.
On 2015-01-14 22:52, Maruan Sahyoun wrote:
> what would you like to do with that content?
>
> BR
> Maruan
>
> Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
>
>> Hello pdfbox people!
>>
>> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>>
>> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>>
>> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>>
>> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>>
>> Thank you for your help in advance!
>>
>> Best regards,
>> Stefan
>
Re: Extract underlying PDF code from PDF file by selecting an area
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
what would you like to do with that content?
BR
Maruan
Am 14.01.2015 um 21:42 schrieb Stefan Falk <s....@student.tugraz.at>:
> Hello pdfbox people!
>
> I was wondering if anybody can help me with my needs. What I am looking for is a possibility to extract the underlying PDF code from a PDF file by simply selecting an area with your mouse.
>
> After reading a few things about PDFs I have learned that anything that has to do with extraction anything from a PDF can be a quite hard task.
>
> So I was wondering if pdfbox could do that somehow. I've taken a rough look at the PDFReader and I noticed that there is e.g. processTextPosition from the class PageDrawer that seem to allow me to get at least the position from Text - am I right in assuming that?
>
> My concrete question would be what is possible with pdfbox regarding this matter? E.g. I have a PDF on my drive which text seems to be "extractable" by pdfbox on the one hand but on the other hand the PDFReader is not able to render any of it. It just renders the images (see attachment).
>
> Thank you for your help in advance!
>
> Best regards,
> Stefan