You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/04/06 04:51:00 UTC

[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

    [ https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315232#comment-17315232 ] 

Tilman Hausherr commented on PDFBOX-5152:
-----------------------------------------

Your content stream is really "q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q ". I looked at it with PDFDebugger. And yes there is more, "/Fm0 Do" is the Fm0 form XObject which is quite long. You should not put a content stream into a String because it's binary data. Re those "bounds" this is outside of PDFBox. Could you create a small minimal stand-alone tool that shows these tokens and those offsets to see where this 313 comes from? I'm willing to help somewhat, but don't want to debug your software.

> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
>                 Key: PDFBOX-5152
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5152
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.23
>            Reporter: Steven Fontaine
>            Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert the colors of a PDF file. An [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which provided a [PDF file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], which when parsed by pdfbox, appears to give a truncated content stream. That is, running the following code results in a substantially shorter content stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
>   for (PDPage page: doc.getPages()) {
>     String stream = new String(IOUtils.toByteArray(page.getContents()), StandardCharsets.UTF_8);
>     System.out.println(stream);
>   }
> }
> {code}
>  The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q 
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content streams, but I can fairly confidently say that more than this is required to draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox is making reference to additional data that isn't contained in the content stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to locate specific operations and their arguments. To do so, I [wrap {{PDFStreamParser.parseNextToken()}} with queries to {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52]. I do so in order to get the bounds of a token in the content stream, without the need to parse it myself, (allowing {{parseNextToken}} to do the work for me.) When I look at the bounds which these queries give me, they extend further than the length of the content stream returned by {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other words, the token parsed by {{parseNextToken}} corresponds to characters 19-313 (inclusive, 0-based index) of the content stream. But the content stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org