You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Igor Spasic (JIRA)" <ji...@apache.org> on 2010/11/09 15:30:07 UTC
[jira] Created: (PDFBOX-890) Can't extract text from PDF
Can't extract text from PDF
---------------------------
Key: PDFBOX-890
URL: https://issues.apache.org/jira/browse/PDFBOX-890
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.3.1
Reporter: Igor Spasic
I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
command:
@java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-890) Can't extract text from PDF
Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933883#action_12933883 ]
Martijn Brinkers commented on PDFBOX-890:
-----------------------------------------
The singleByteMappings contain all the characters ('E', 'x', 't'.... ). The singleByteMappings are not used. I have attached a patch that fixes this. The PDF gurus should check whether my patch is correct or whether it just fixes this particular bug.
> Can't extract text from PDF
> ---------------------------
>
> Key: PDFBOX-890
> URL: https://issues.apache.org/jira/browse/PDFBOX-890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Igor Spasic
> Attachments: test.pdf
>
>
> I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
> PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
> command:
> @java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-890) Can't extract text from PDF
Posted by "Igor Spasic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Igor Spasic updated PDFBOX-890:
-------------------------------
Attachment: test.pdf
> Can't extract text from PDF
> ---------------------------
>
> Key: PDFBOX-890
> URL: https://issues.apache.org/jira/browse/PDFBOX-890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Igor Spasic
> Attachments: test.pdf
>
>
> I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
> PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
> command:
> @java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-890) Can't extract text from PDF
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980593#action_12980593 ]
Andreas Lehmkühler commented on PDFBOX-890:
-------------------------------------------
The text extraction works fine in the current trunk (rev. 1057780). The rendering is still mixed up.
> Can't extract text from PDF
> ---------------------------
>
> Key: PDFBOX-890
> URL: https://issues.apache.org/jira/browse/PDFBOX-890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Igor Spasic
> Attachments: PDFBOX-890.patch, test.pdf
>
>
> I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
> PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
> command:
> @java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-890) Can't extract text from PDF
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972397#action_12972397 ]
Andreas Lehmkühler commented on PDFBOX-890:
-------------------------------------------
Looks good. But we have to run some more tests as I'm not sure if your patch is the solution or just a workaround for the given pdf with possible sideeffects.
> Can't extract text from PDF
> ---------------------------
>
> Key: PDFBOX-890
> URL: https://issues.apache.org/jira/browse/PDFBOX-890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Igor Spasic
> Attachments: PDFBOX-890.patch, test.pdf
>
>
> I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
> PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
> command:
> @java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-890) Can't extract text from PDF
Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martijn Brinkers updated PDFBOX-890:
------------------------------------
Attachment: PDFBOX-890.patch
> Can't extract text from PDF
> ---------------------------
>
> Key: PDFBOX-890
> URL: https://issues.apache.org/jira/browse/PDFBOX-890
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1
> Reporter: Igor Spasic
> Attachments: PDFBOX-890.patch, test.pdf
>
>
> I have created a simply pdf by using Bullzip PDF printer (virtual Windows printer).
> PDFBOX is not able to parse text from this PDF, it just return some low ascii chars.
> command:
> @java -jar pdfbox-app-1.3.1.jar ExtractText -console test.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.