You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/01/24 19:18:00 UTC

[jira] [Comment Edited] (PDFBOX-4750) java.io.IOException: Error:Unknown type in content stream:COSNull{}

    [ https://issues.apache.org/jira/browse/PDFBOX-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022695#comment-17022695 ] 

Tilman Hausherr edited comment on PDFBOX-4750 at 1/24/20 7:17 PM:
------------------------------------------------------------------

Uhm, if you can't share the file, then I hope the actual content here isn't confidential, the text starts with "in der von uns unter" so check whether the page text is harmless.

The /DP [   ] segment is just three spaces. Could it be that the "null" thing is in another page? Or check with the hex viewer in PDFDebugger. Or was the "null" inserted by the software that modified the content stream? Or was something lost in copy & paste?

This code works fine:
{code}
PDFStreamParser parser = new PDFStreamParser(new FileInputStream("contentAllOperatorsOfCorruptedPage.txt"));
parser.parse();
List<Object> tokens = parser.getTokens();
ContentStreamWriter tokenWriter = new ContentStreamWriter(new ByteArrayOutputStream());
tokenWriter.writeTokens(tokens);
{code}



was (Author: tilman):
Uhm, if you can't share the file, then I hope the actual content here isn't confidential, the text starts with "in der von uns unter" so check whether the page text is harmless.

The /DP [   ] segment is just three spaces. Could it be that the "null" thing is in another page? Or check with the hex viewer in PDFDebugger. Or was the "null" inserted by the software that modified the content stream? Or was something lost in copy & paste?

This code works fine:
{code}
PDFStreamParser parser = new PDFStreamParser(new FileInputStream("contentAllOperatorsOfCorruptedPage.txt"));
List<Object> tokens = parser.getTokens();
ContentStreamWriter tokenWriter = new ContentStreamWriter(new ByteArrayOutputStream());
tokenWriter.writeTokens(tokens);
{code}


> java.io.IOException: Error:Unknown type in content stream:COSNull{}
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-4750
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4750
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Writing
>    Affects Versions: 2.0.8, 2.0.18
>            Reporter: tomas kochan
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.19, 3.0.0 PDFBox
>
>         Attachments: 01 - K17 - Was dahinter steckt - dsb.pdf, contentAllOperatorsOfCorruptedPage.txt
>
>
> By removing some optional content for specific document, which is bordered with Operator BDC and EMC, we are facing an issue by writing the changed set of tokens into PDStream. 
>  The code looks like:
>  PDStream updatedStream = new PDStream(document);
>  OutputStream out = updatedStream.getCOSObject().createRawOutputStream();
>  ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
>  tokenWriter.writeTokens(result);
>  out.flush();
>  out.close();
>  page.setContents(updatedStream);
>   
>  The following exception occurs at line 'tokenWriter.writeTokens(result);' :
>  java.io.IOException: Error:Unknown type in content stream:COSNull{}
>  at org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeObject(ContentStreamWriter.java:199)
>  at org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeObject(ContentStreamWriter.java:146)
>  at org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeObject(ContentStreamWriter.java:181)
>  at org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeTokens(ContentStreamWriter.java:109)
>  at de.justiz.eip.pdf.tools.PdfContext.getOrRemoveOptionalTextContentfromPage(PdfContext.java:429)
>  at de.justiz.eip.pdf.tools.paging.PagingInfoInterpreterPdfContext.removePagingInfo(PagingInfoInterpreterPdfContext.java:325)
>  
> After the analyze we figured out two issues:
>  1. We assume, the Pdf Document it's self is corrupted, It contains on some place operator BI, which is based on the PDF-Reference-V1.7 a begin of inline image object. This Operator is not followed by "ID" or "EI" operator. 
>  Extract from list of Tokens:
>  next PDFOperator\{Do}
> next COSFloat\{0.016674607}
> next COSInt\{0}
>  next COSInt\{0}
> next COSFloat\{0.061831153}
> next COSFloat\{0.070509767}
> next COSFloat\{-0.302021403}
>  next PDFOperator\{cm}
> next PDFOperator\{BI}
> next PDFOperator\{Q}
>  next PDFOperator\{Q}
> next COSName\{OC}
> next COSName\{eAkteOptionalContent7}
> next PDFOperator\{BDC}
> Moreover one "DP" Entry in the "BI" operator's COSDictionary contains COSArray with COSNull values. However the assumption is, that the COSNull values are not forbidden in the Pdf content. 
>  COSDictionary\{COSName{Interpolate}:true;COSName\{W}:COSInt\{35};COSName\{H}:COSInt\{26};COSName\{CS}:COSName\{RGB};COSName\{BPC}:COSInt\{8};COSName\{F}:COSArray\{[COSName{A85},COSName\{DCT}]};COSName\{DP}:COSArray{[COSNull{},COSNull{}]};}
> 2. Despite wrong content in the pdf-document (described above) the PDF-Box api crashed by storing this operators into PDStream by his inability to recognize COSNull in the method org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeObject(Object)
>  
> The assumption on this place is, that the method "writeObject" forgot to cover COSNull  as an valid input. The org.apache.pdfbox.cos.COSNull.NULL is valid Object, which is broadly used by PDF-Api itself.
> The Method org.apache.pdfbox.pdfwriter.ContentStreamWriter.writeObject(Object)  PDF-Api 2.0.8, also in 2.0.18 doesn't cover the COSNull  case in it's if/else conditions, instead of it throws the new IOException( "Error:Unknown type in content stream:" + o ). 
>  
> Could you confirm, that the method writeObject contains bug and should be corrected to cover also COSNull Object? If so, in which version could we expect the fix?
> Thank you
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org