You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by ha...@vkb.de on 2015/12/21 16:02:45 UTC

Using PDFStreamParser

Hi!
 
I have a very strange behaviour while copying a file with PDFBoxs 
PDFStreamParser (RC2).

I modfied RemoveAllText not to remove any text:

 public static void main( String[] args ) throws IOException
    {
...
              PDDocument document = null;
            try
            {
                document = PDDocument.load( new File(args[0]) );
                if( document.isEncrypted() )
                {
                    System.err.println( "Error: Encrypted documents are 
not supported for this example." );
                    System.exit( 1 );
                }
                for( PDPage page : document.getPages() )
                {
                    PDFStreamParser parser = new PDFStreamParser(page);
                    parser.parse();
                    List<Object> tokens = parser.getTokens();
                    List<Object> newTokens = new ArrayList<Object>();
                    for (Object token : tokens)
                    {
                        newTokens.add( token );
                    }
                    PDStream newContents = new PDStream( document );
                    OutputStream out = newContents
.createOutputStream(COSName.FLATE_DECODE);
                    ContentStreamWriter writer = new ContentStreamWriter( 
out );
                    writer.writeTokens( newTokens );
                    out.close();
                    page.setContents( newContents );
                }
                document.save( args[1] );
            }
            finally
            {
                if( document != null )
                {
                    document.close();
                }
            }
    }

I open both PDFs with PDFDebugger and the Contents text view is equal for 
both files (see second TJ!). In hex view there are differences with space 
(20) an LF-Chars (0A), where eol seems to be inserted/replaced.

BT
  0 0 0 1 k
  /T1_0 1 Tf
  10 0 0 10 32.4181 265.8897 Tm
  [ (\037\036\035\034\033\032\031\030\027) -28 
(\026\025\035\024\023\022\025\031\031\030\035\021) ] TJ
  /T1_1 1 Tf
  9.8 0 0 10 32.4181 253.8897 Tm
  [ (\037\036\035\034\033\032\031\030\027\026\025\024) -53 (\023\022\024) 
-53 (\021\020\017\016\024) -53 (\015\023\014\013\012\011\024) -53 
(\010\030\027\026\025\024) -53 (\015\007\020\017\016\024) -53 
(\015\011\024) -53 (\006\025\033\005\025\004\026\003\025\002\026\024) -53 
(\002\001\027\024) -53 (\177\004\025\024) -53 ... TJ

Consenquently the preview in PDFDebugger (page two!) is the same too.

Übungskarte 49 (INT 1463), Karte 1/INT 1, Begleitheft für die 
Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)



But when opening the new PDF file with Adobe Reader 11.0.10.32 the text 
has changed!! 1 is now ), but not für 2013!

Übungskarte 49 (INT )463), Karte )/INT ), Begleitheft für die 
Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)

On page three Aufgabe is now Auf0abe.

I have no idea how this can happen. Is there information anywhere else 
except in the TJ-Block? The file size (old 960 K, new 1041 K) is slightly 
different for 81 pages.

This is the pdf 
https://www.elwis.de/Freizeitschifffahrt/fuehrerscheininformationen/Navigationsaufgaben-SKS.pdf


Thanks

Hans Stemmer

Antwort: Re: Using PDFStreamParser

Posted by ha...@vkb.de.

With RC2 the text is displayed changed (0 for g and ) for 1).
But with the latest build (pdfbox-app-2.0.0-20151221.122720-1872) 
everything is fine. Thanks a lot.

Kind regard
Hans




Von:    Tilman Hausherr <TH...@t-online.de>
An:     users@pdfbox.apache.org
Datum:  21.12.2015 19:08
Betreff:        Re: Using PDFStreamParser



Could you retry with the current version? Either get -SNAPSHOT through 
maven, or from
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/


I can't reproduce what you mean (I tested with the trunk), so either I 
missed it, or (what I suspect) it is a bug that I fixed a short time ago 
(PDFBOX-3107). However I'm also unable to reproduce it with RC2 and RC1.

Tilman

Am 21.12.2015 um 16:02 schrieb hans-juergen.stemmer@vkb.de:
> Hi!
> 
> I have a very strange behaviour while copying a file with PDFBoxs
> PDFStreamParser (RC2).
>
> I modfied RemoveAllText not to remove any text:
>
>   public static void main( String[] args ) throws IOException
>      {
> ...
>                PDDocument document = null;
>              try
>              {
>                  document = PDDocument.load( new File(args[0]) );
>                  if( document.isEncrypted() )
>                  {
>                      System.err.println( "Error: Encrypted documents are
> not supported for this example." );
>                      System.exit( 1 );
>                  }
>                  for( PDPage page : document.getPages() )
>                  {
>                      PDFStreamParser parser = new PDFStreamParser(page);
>                      parser.parse();
>                      List<Object> tokens = parser.getTokens();
>                      List<Object> newTokens = new ArrayList<Object>();
>                      for (Object token : tokens)
>                      {
>                          newTokens.add( token );
>                      }
>                      PDStream newContents = new PDStream( document );
>                      OutputStream out = newContents
> .createOutputStream(COSName.FLATE_DECODE);
>                      ContentStreamWriter writer = new 
ContentStreamWriter(
> out );
>                      writer.writeTokens( newTokens );
>                      out.close();
>                      page.setContents( newContents );
>                  }
>                  document.save( args[1] );
>              }
>              finally
>              {
>                  if( document != null )
>                  {
>                      document.close();
>                  }
>              }
>      }
>
> I open both PDFs with PDFDebugger and the Contents text view is equal 
for
> both files (see second TJ!). In hex view there are differences with 
space
> (20) an LF-Chars (0A), where eol seems to be inserted/replaced.
>
> BT
>    0 0 0 1 k
>    /T1_0 1 Tf
>    10 0 0 10 32.4181 265.8897 Tm
>    [ (\037\036\035\034\033\032\031\030\027) -28
> (\026\025\035\024\023\022\025\031\031\030\035\021) ] TJ
>    /T1_1 1 Tf
>    9.8 0 0 10 32.4181 253.8897 Tm
>    [ (\037\036\035\034\033\032\031\030\027\026\025\024) -53 
(\023\022\024)
> -53 (\021\020\017\016\024) -53 (\015\023\014\013\012\011\024) -53
> (\010\030\027\026\025\024) -53 (\015\007\020\017\016\024) -53
> (\015\011\024) -53 (\006\025\033\005\025\004\026\003\025\002\026\024) 
-53
> (\002\001\027\024) -53 (\177\004\025\024) -53 ... TJ
>
> Consenquently the preview in PDFDebugger (page two!) is the same too.
>
> Übungskarte 49 (INT 1463), Karte 1/INT 1, Begleitheft für die
> Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)
>
>
>
> But when opening the new PDF file with Adobe Reader 11.0.10.32 the text
> has changed!! 1 is now ), but not für 2013!
>
> Übungskarte 49 (INT )463), Karte )/INT ), Begleitheft für die
> Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)
>
> On page three Aufgabe is now Auf0abe.
>
> I have no idea how this can happen. Is there information anywhere else
> except in the TJ-Block? The file size (old 960 K, new 1041 K) is 
slightly
> different for 81 pages.
>
> This is the pdf
> 
https://www.elwis.de/Freizeitschifffahrt/fuehrerscheininformationen/Navigationsaufgaben-SKS.pdf

>
>
> Thanks
>
> Hans Stemmer
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Using PDFStreamParser

Posted by Tilman Hausherr <TH...@t-online.de>.

Could you retry with the current version? Either get -SNAPSHOT through 
maven, or from
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

I can't reproduce what you mean (I tested with the trunk), so either I 
missed it, or (what I suspect) it is a bug that I fixed a short time ago 
(PDFBOX-3107). However I'm also unable to reproduce it with RC2 and RC1.

Tilman

Am 21.12.2015 um 16:02 schrieb hans-juergen.stemmer@vkb.de:
> Hi!
>   
> I have a very strange behaviour while copying a file with PDFBoxs
> PDFStreamParser (RC2).
>
> I modfied RemoveAllText not to remove any text:
>
>   public static void main( String[] args ) throws IOException
>      {
> ...
>                PDDocument document = null;
>              try
>              {
>                  document = PDDocument.load( new File(args[0]) );
>                  if( document.isEncrypted() )
>                  {
>                      System.err.println( "Error: Encrypted documents are
> not supported for this example." );
>                      System.exit( 1 );
>                  }
>                  for( PDPage page : document.getPages() )
>                  {
>                      PDFStreamParser parser = new PDFStreamParser(page);
>                      parser.parse();
>                      List<Object> tokens = parser.getTokens();
>                      List<Object> newTokens = new ArrayList<Object>();
>                      for (Object token : tokens)
>                      {
>                          newTokens.add( token );
>                      }
>                      PDStream newContents = new PDStream( document );
>                      OutputStream out = newContents
> .createOutputStream(COSName.FLATE_DECODE);
>                      ContentStreamWriter writer = new ContentStreamWriter(
> out );
>                      writer.writeTokens( newTokens );
>                      out.close();
>                      page.setContents( newContents );
>                  }
>                  document.save( args[1] );
>              }
>              finally
>              {
>                  if( document != null )
>                  {
>                      document.close();
>                  }
>              }
>      }
>
> I open both PDFs with PDFDebugger and the Contents text view is equal for
> both files (see second TJ!). In hex view there are differences with space
> (20) an LF-Chars (0A), where eol seems to be inserted/replaced.
>
> BT
>    0 0 0 1 k
>    /T1_0 1 Tf
>    10 0 0 10 32.4181 265.8897 Tm
>    [ (\037\036\035\034\033\032\031\030\027) -28
> (\026\025\035\024\023\022\025\031\031\030\035\021) ] TJ
>    /T1_1 1 Tf
>    9.8 0 0 10 32.4181 253.8897 Tm
>    [ (\037\036\035\034\033\032\031\030\027\026\025\024) -53 (\023\022\024)
> -53 (\021\020\017\016\024) -53 (\015\023\014\013\012\011\024) -53
> (\010\030\027\026\025\024) -53 (\015\007\020\017\016\024) -53
> (\015\011\024) -53 (\006\025\033\005\025\004\026\003\025\002\026\024) -53
> (\002\001\027\024) -53 (\177\004\025\024) -53 ... TJ
>
> Consenquently the preview in PDFDebugger (page two!) is the same too.
>
> Übungskarte 49 (INT 1463), Karte 1/INT 1, Begleitheft für die
> Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)
>
>
>
> But when opening the new PDF file with Adobe Reader 11.0.10.32 the text
> has changed!! 1 is now ), but not für 2013!
>
> Übungskarte 49 (INT )463), Karte )/INT ), Begleitheft für die
> Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)
>
> On page three Aufgabe is now Auf0abe.
>
> I have no idea how this can happen. Is there information anywhere else
> except in the TJ-Block? The file size (old 960 K, new 1041 K) is slightly
> different for 81 pages.
>
> This is the pdf
> https://www.elwis.de/Freizeitschifffahrt/fuehrerscheininformationen/Navigationsaufgaben-SKS.pdf
>
>
> Thanks
>
> Hans Stemmer
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org