You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/11/06 04:20:00 UTC
[jira] [Comment Edited] (PDFBOX-4367) Error expected floating point number actual='18-5'

    [ https://issues.apache.org/jira/browse/PDFBOX-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676114#comment-16676114 ] 

Tilman Hausherr edited comment on PDFBOX-4367 at 11/6/18 4:19 AM:
------------------------------------------------------------------

The "-force" option is a documentation leftover, there's a fresh issue about it, PDFBOX-4369.

There is no option to process each page by itself. Apache Tika (which uses PDFBox) has it, I know it because we discussed it maybe a year ago. I can't find it on [https://tika.apache.org/1.19.1/gettingstarted.html] so maybe it isn't available on the command line, only programmatically.

You could also modify ExtractText for yourself. You first get the number of pages (document.getNumberOfPages()), and then call `setStartPage()` and `setEndPage()` for each page and run `writeText()` several times. Note that the page numbers are 1-based here. (0-based at some other places)

I could also implement it for ExtractText, it would make sense for people who need this and can't change the code. Main problem is that I'd need a good name for the option. (Not "force").


was (Author: tilman):
The "-force" option is a documentation leftover, there's a fresh issue about it, PDFBOX-4369.

There is no option to process each page by itself. Apache Tika (which uses PDFBox) has it, I know it because we discussed it maybe a year ago. I can't find it on [https://tika.apache.org/1.19.1/gettingstarted.html] so maybe it isn't available on the command line, only programmatically.

You could also modify ExtractText for yourself. You first get the number of pages (document.getNumberOfPages()), and then call `setStartPage()` and `setEndPage()` for each page and run `writeText()` several times.

I could also implement it for ExtractText, it would make sense for people who need this and can't change the code. Main problem is that I'd need a good name for the option. (Not "force").

> Error expected floating point number actual='18-5'
> --------------------------------------------------
>
>                 Key: PDFBOX-4367
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4367
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.12
>         Environment: Mac OS X Sierra
>            Reporter: Peter Johnson
>            Priority: Minor
>
> Able to repeat with command line.  Unfortunately, the only files that repeat this are from a customer, and contain sensitive information.  The file opens without error in Acrobat Reader and Mac Preview.  The desired result is that any corrupt portions of the PDF are skipped, so that we can use what text is extractable.
> Unfortunately, I still get an error when using the -force option.
> We get the following stack trace:
> {code:java}
> C02V390UHTD6:Downloads pjohnson$ java -jar pdfbox-app-2.0.12.jar ExtractText 16cccd9af5032a303774f7b87fb95076.pdf
> Nov 02, 2018 10:04:54 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 19727
> Exception in thread "main" java.io.IOException: Error expected floating point number actual='18-5'
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
> at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:110)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:947)
> at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:631)
> at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:174)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> Caused by: java.lang.NumberFormatException
> at java.math.BigDecimal.<init>(BigDecimal.java:494)
> at java.math.BigDecimal.<init>(BigDecimal.java:383)
> at java.math.BigDecimal.<init>(BigDecimal.java:806)
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
> ... 14 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org