You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Esteban R <er...@hotmail.com> on 2018/02/05 14:43:35 UTC

Stream parsing issue in multi-stream page

Hello. I need to rewrite a PDPage with many streams, one by one (making some transformations, and there is a special need to do it one stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins at the end of the first stream and ends at the begining of the next one. I'm using pdfbox-2.0.8.

Rewriting the stream with those tokens produces a corrupted page.
How could we re-write the page without getting a corrupted page?
Or, at least, how can we detect this kind of failures (or this one)?

Please find a simplified example here:
http://www.filedropper.com/out3unc

The first stream is:
/F1 10 Tf
BT
40 764.138 Td
0 -12.138 Td
[

and the second one is:
(CD) ] TJ
ET

In this case, running the following code:
        Iterator<PDStream> itStreams = pdPage.getContentStreams();
        while (itStreams.hasNext()) {
            PDStream pdstream = itStreams.next();
            PDFStreamParser parser = new PDFStreamParser(pdstream.toByteArray());
            parser.parse();
            List<Object> tokens = parser.getTokens();
            for (Object token: tokens){
                System.out.println("Token: "+token);
            }
        }

shows:
Token: COSName{F1}
Token: COSInt{10}
Token: PDFOperator{Tf}
Token: PDFOperator{BT}
Token: COSInt{40}
Token: COSFloat{764.138}
Token: PDFOperator{Td}
Token: COSInt{0}
Token: COSFloat{-12.138}
Token: PDFOperator{Td}
Token: COSArray{[]}                    !!!!! empty array detected, end of first stream
Token: COSString{CD}                 !!!!! begining of second stream
Token: COSNull{}                         !!!!! closing "]"
Token: PDFOperator{TJ}
Token: PDFOperator{ET}


Esteban

Re: Stream parsing issue in multi-stream page

Posted by Malcolm Vincent <ma...@gmail.com>.

I had to do something similar recently myself. Don't. It doesn't work. You
have to read the page as one stream.

Cheers!
Malcolm.


On 5 February 2018 at 18:30, Esteban R <er...@hotmail.com> wrote:

> I need to analyze the distribution of contents in the different streams (I
> cannot provide additional details due to a confidentiality aggreement).
> Then I may need to change some content in the streams and rewrite them. I
> also wanted to preserve the original structure of (many) streams, but it is
> not a hard requirement.
>
> Esteban
> ________________________________
> De: Maruan Sahyoun <sa...@fileaffairs.de>
> Enviado: lunes, 05 de febrero de 2018 04:19 p.m.
> Para: Esteban R
> Asunto: Re: Stream parsing issue in multi-stream page
>
> Hi,
> > Am 05.02.2018 um 17:14 schrieb Esteban R <er...@hotmail.com>:
> >
> > Thanks for your answer. But I really need to process the streams one by
> one (a special requirement in my project).
>
> could you explain why this is the case? It is possible that tokens are
> spawning streams - so if you process one by one the parser wouldn't know
> about the continuation. So the result you posted initially is fine from
> that perspective.
>
> BR
> Maruan
>
> >
> > Anyways, your answer gave me an idea for detecting the issue: I can
> compare the tokens for the individual streams with the tokens from
> pdPage.getContents().... double processing, but still useful.
> >
> > Any other ideas are wellcome.
> >
> > Esteban
> > De: Maruan Sahyoun <sa...@fileaffairs.de>
> > Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
> > Para: users@pdfbox.apache.org
> > Asunto: Re: Stream parsing issue in multi-stream page
> >
> > Hi,
> >
> >
> >
> > > Am 05.02.2018 um 15:43 schrieb Esteban R <er...@hotmail.com>:
> > >
> > > Hello. I need to rewrite a PDPage with many streams, one by one
> (making some transformations, and there is a special need to do it one
> stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one
> command begins at the end of the first stream and ends at the begining of
> the next one. I'm using pdfbox-2.0.8.
> > >
> > > Rewriting the stream with those tokens produces a corrupted page.
> > > How could we re-write the page without getting a corrupted page?
> > > Or, at least, how can we detect this kind of failures (or this one)?
> > >
> > > Please find a simplified example here:
> > > http://www.filedropper.com/out3unc
> > >
> > > The first stream is:
> > > /F1 10 Tf
> > > BT
> > > 40 764.138 Td
> > > 0 -12.138 Td
> > > [
> > >
> > > and the second one is:
> > > (CD) ] TJ
> > > ET
> > >
> > > In this case, running the following code:
> > >        Iterator<PDStream> itStreams = pdPage.getContentStreams();
> > >        while (itStreams.hasNext()) {
> > >            PDStream pdstream = itStreams.next();
> > >            PDFStreamParser parser = new PDFStreamParser(pdstream.
> toByteArray());
> > >            parser.parse();
> > >            List<Object> tokens = parser.getTokens();
> > >            for (Object token: tokens){
> > >                System.out.println("Token: "+token);
> > >            }
> > >        }
> > >
> >
> > instead of using pdPage.getContentStreams() and parsing the stream
> individually use pdPage.getContents() and read all content into a byte[].
> You can then pass that to PDFStreamParser.
> >
> > That will give you this output
> >
> > Token: COSName{F1}
> > Token: COSInt{10}
> > Token: PDFOperator{Tf}
> > Token: PDFOperator{BT}
> > Token: COSInt{40}
> > Token: COSFloat{764.138}
> > Token: PDFOperator{Td}
> > Token: COSInt{0}
> > Token: COSFloat{-12.138}
> > Token: PDFOperator{Td}
> > Token: COSArray{[COSString{CD}]}
> > Token: PDFOperator{TJ}
> > Token: PDFOperator{ET}
> >
> > BR
> > Maruan
> >
> >
> > > shows:
> > > Token: COSName{F1}
> > > Token: COSInt{10}
> > > Token: PDFOperator{Tf}
> > > Token: PDFOperator{BT}
> > > Token: COSInt{40}
> > > Token: COSFloat{764.138}
> > > Token: PDFOperator{Td}
> > > Token: COSInt{0}
> > > Token: COSFloat{-12.138}
> > > Token: PDFOperator{Td}
> > > Token: COSArray{[]}                    !!!!! empty array detected, end
> of first stream
> > > Token: COSString{CD}                 !!!!! begining of second stream
> > > Token: COSNull{}                         !!!!! closing "]"
> > > Token: PDFOperator{TJ}
> > > Token: PDFOperator{ET}
> > >
> > >
> > > Esteban
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

RE: Stream parsing issue in multi-stream page

Posted by Esteban R <er...@hotmail.com>.

I need to analyze the distribution of contents in the different streams (I cannot provide additional details due to a confidentiality aggreement). Then I may need to change some content in the streams and rewrite them. I also wanted to preserve the original structure of (many) streams, but it is not a hard requirement.

Esteban
________________________________
De: Maruan Sahyoun <sa...@fileaffairs.de>
Enviado: lunes, 05 de febrero de 2018 04:19 p.m.
Para: Esteban R
Asunto: Re: Stream parsing issue in multi-stream page

Hi,
> Am 05.02.2018 um 17:14 schrieb Esteban R <er...@hotmail.com>:
>
> Thanks for your answer. But I really need to process the streams one by one (a special requirement in my project).

could you explain why this is the case? It is possible that tokens are spawning streams - so if you process one by one the parser wouldn't know about the continuation. So the result you posted initially is fine from that perspective.

BR
Maruan

>
> Anyways, your answer gave me an idea for detecting the issue: I can compare the tokens for the individual streams with the tokens from pdPage.getContents().... double processing, but still useful.
>
> Any other ideas are wellcome.
>
> Esteban
> De: Maruan Sahyoun <sa...@fileaffairs.de>
> Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
> Para: users@pdfbox.apache.org
> Asunto: Re: Stream parsing issue in multi-stream page
>
> Hi,
>
>
>
> > Am 05.02.2018 um 15:43 schrieb Esteban R <er...@hotmail.com>:
> >
> > Hello. I need to rewrite a PDPage with many streams, one by one (making some transformations, and there is a special need to do it one stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins at the end of the first stream and ends at the begining of the next one. I'm using pdfbox-2.0.8.
> >
> > Rewriting the stream with those tokens produces a corrupted page.
> > How could we re-write the page without getting a corrupted page?
> > Or, at least, how can we detect this kind of failures (or this one)?
> >
> > Please find a simplified example here:
> > http://www.filedropper.com/out3unc
> >
> > The first stream is:
> > /F1 10 Tf
> > BT
> > 40 764.138 Td
> > 0 -12.138 Td
> > [
> >
> > and the second one is:
> > (CD) ] TJ
> > ET
> >
> > In this case, running the following code:
> >        Iterator<PDStream> itStreams = pdPage.getContentStreams();
> >        while (itStreams.hasNext()) {
> >            PDStream pdstream = itStreams.next();
> >            PDFStreamParser parser = new PDFStreamParser(pdstream.toByteArray());
> >            parser.parse();
> >            List<Object> tokens = parser.getTokens();
> >            for (Object token: tokens){
> >                System.out.println("Token: "+token);
> >            }
> >        }
> >
>
> instead of using pdPage.getContentStreams() and parsing the stream individually use pdPage.getContents() and read all content into a byte[]. You can then pass that to PDFStreamParser.
>
> That will give you this output
>
> Token: COSName{F1}
> Token: COSInt{10}
> Token: PDFOperator{Tf}
> Token: PDFOperator{BT}
> Token: COSInt{40}
> Token: COSFloat{764.138}
> Token: PDFOperator{Td}
> Token: COSInt{0}
> Token: COSFloat{-12.138}
> Token: PDFOperator{Td}
> Token: COSArray{[COSString{CD}]}
> Token: PDFOperator{TJ}
> Token: PDFOperator{ET}
>
> BR
> Maruan
>
>
> > shows:
> > Token: COSName{F1}
> > Token: COSInt{10}
> > Token: PDFOperator{Tf}
> > Token: PDFOperator{BT}
> > Token: COSInt{40}
> > Token: COSFloat{764.138}
> > Token: PDFOperator{Td}
> > Token: COSInt{0}
> > Token: COSFloat{-12.138}
> > Token: PDFOperator{Td}
> > Token: COSArray{[]}                    !!!!! empty array detected, end of first stream
> > Token: COSString{CD}                 !!!!! begining of second stream
> > Token: COSNull{}                         !!!!! closing "]"
> > Token: PDFOperator{TJ}
> > Token: PDFOperator{ET}
> >
> >
> > Esteban
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

RE: Stream parsing issue in multi-stream page

Posted by Esteban R <er...@hotmail.com>.

Thanks for your answer. But I really need to process the streams one by one (a special requirement in my project).

Anyways, your answer gave me an idea for detecting the issue: I can compare the tokens for the individual streams with the tokens from pdPage.getContents().... double processing, but still useful.

Any other ideas are wellcome.

Esteban
________________________________
De: Maruan Sahyoun <sa...@fileaffairs.de>
Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
Para: users@pdfbox.apache.org
Asunto: Re: Stream parsing issue in multi-stream page

Hi,



> Am 05.02.2018 um 15:43 schrieb Esteban R <er...@hotmail.com>:
>
> Hello. I need to rewrite a PDPage with many streams, one by one (making some transformations, and there is a special need to do it one stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins at the end of the first stream and ends at the begining of the next one. I'm using pdfbox-2.0.8.
>
> Rewriting the stream with those tokens produces a corrupted page.
> How could we re-write the page without getting a corrupted page?
> Or, at least, how can we detect this kind of failures (or this one)?
>
> Please find a simplified example here:
> http://www.filedropper.com/out3unc
>
> The first stream is:
> /F1 10 Tf
> BT
> 40 764.138 Td
> 0 -12.138 Td
> [
>
> and the second one is:
> (CD) ] TJ
> ET
>
> In this case, running the following code:
>        Iterator<PDStream> itStreams = pdPage.getContentStreams();
>        while (itStreams.hasNext()) {
>            PDStream pdstream = itStreams.next();
>            PDFStreamParser parser = new PDFStreamParser(pdstream.toByteArray());
>            parser.parse();
>            List<Object> tokens = parser.getTokens();
>            for (Object token: tokens){
>                System.out.println("Token: "+token);
>            }
>        }
>

instead of using pdPage.getContentStreams() and parsing the stream individually use pdPage.getContents() and read all content into a byte[]. You can then pass that to PDFStreamParser.

That will give you this output

Token: COSName{F1}
Token: COSInt{10}
Token: PDFOperator{Tf}
Token: PDFOperator{BT}
Token: COSInt{40}
Token: COSFloat{764.138}
Token: PDFOperator{Td}
Token: COSInt{0}
Token: COSFloat{-12.138}
Token: PDFOperator{Td}
Token: COSArray{[COSString{CD}]}
Token: PDFOperator{TJ}
Token: PDFOperator{ET}

BR
Maruan


> shows:
> Token: COSName{F1}
> Token: COSInt{10}
> Token: PDFOperator{Tf}
> Token: PDFOperator{BT}
> Token: COSInt{40}
> Token: COSFloat{764.138}
> Token: PDFOperator{Td}
> Token: COSInt{0}
> Token: COSFloat{-12.138}
> Token: PDFOperator{Td}
> Token: COSArray{[]}                    !!!!! empty array detected, end of first stream
> Token: COSString{CD}                 !!!!! begining of second stream
> Token: COSNull{}                         !!!!! closing "]"
> Token: PDFOperator{TJ}
> Token: PDFOperator{ET}
>
>
> Esteban


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Stream parsing issue in multi-stream page

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,



> Am 05.02.2018 um 15:43 schrieb Esteban R <er...@hotmail.com>:
> 
> Hello. I need to rewrite a PDPage with many streams, one by one (making some transformations, and there is a special need to do it one stream at a time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins at the end of the first stream and ends at the begining of the next one. I'm using pdfbox-2.0.8.
> 
> Rewriting the stream with those tokens produces a corrupted page.
> How could we re-write the page without getting a corrupted page?
> Or, at least, how can we detect this kind of failures (or this one)?
> 
> Please find a simplified example here:
> http://www.filedropper.com/out3unc
> 
> The first stream is:
> /F1 10 Tf
> BT
> 40 764.138 Td
> 0 -12.138 Td
> [
> 
> and the second one is:
> (CD) ] TJ
> ET
> 
> In this case, running the following code:
>        Iterator<PDStream> itStreams = pdPage.getContentStreams();
>        while (itStreams.hasNext()) {
>            PDStream pdstream = itStreams.next();
>            PDFStreamParser parser = new PDFStreamParser(pdstream.toByteArray());
>            parser.parse();
>            List<Object> tokens = parser.getTokens();
>            for (Object token: tokens){
>                System.out.println("Token: "+token);
>            }
>        }
> 

instead of using pdPage.getContentStreams() and parsing the stream individually use pdPage.getContents() and read all content into a byte[]. You can then pass that to PDFStreamParser.

That will give you this output 

Token: COSName{F1}
Token: COSInt{10}
Token: PDFOperator{Tf}
Token: PDFOperator{BT}
Token: COSInt{40}
Token: COSFloat{764.138}
Token: PDFOperator{Td}
Token: COSInt{0}
Token: COSFloat{-12.138}
Token: PDFOperator{Td}
Token: COSArray{[COSString{CD}]}
Token: PDFOperator{TJ}
Token: PDFOperator{ET}

BR
Maruan


> shows:
> Token: COSName{F1}
> Token: COSInt{10}
> Token: PDFOperator{Tf}
> Token: PDFOperator{BT}
> Token: COSInt{40}
> Token: COSFloat{764.138}
> Token: PDFOperator{Td}
> Token: COSInt{0}
> Token: COSFloat{-12.138}
> Token: PDFOperator{Td}
> Token: COSArray{[]}                    !!!!! empty array detected, end of first stream
> Token: COSString{CD}                 !!!!! begining of second stream
> Token: COSNull{}                         !!!!! closing "]"
> Token: PDFOperator{TJ}
> Token: PDFOperator{ET}
> 
> 
> Esteban


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org