You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Mel Martinez (JIRA)" <ji...@apache.org> on 2010/01/07 00:34:54 UTC

[jira] Updated: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

     [ https://issues.apache.org/jira/browse/PDFBOX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-591:
--------------------------------

    Attachment: BaseParser.java

tweaked version of BaseParser  to improve performance of readUntilEndStream() method.

> PDFBox performance issue:  BaseParser.readUntilEndStream() rewrite
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-591
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-591
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: BaseParser.java
>
>
> The load time for loading documents into PDFBox (PDDocument) is too slow.
> One culprit is the method:  org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
> The current implementation of this method uses a very slow test for end of stream conditions.   A profile of the readUntilEndStream() method shows that a huge chunk of the method's processing time is being consumed in the cmpCircularBuffer() call - which is purely part of the test for for the end of stream marker.  In other words, the readUntilEndOfStream() is spending twice as much time testing for the end of stream marker as it is reading bytes from the stream.
> A better solution is to use a simpler, direct fail-fast test conditional structure that uses byte primitives.   I strongly recommend that the current method be removed and replaced with the following code below.  This results in a relative speed up of readUntilEndStream() method of a little over a factor of 3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This in turn helps the overall performance of PDDocument.parse() by about a factor of 2.7.
> Note the addition of some byte constants used to make the code readable.
> -----------------------------------------------------------------
>     private static final int E = 101;
>     private static final int N = 110;
>     private static final int D = 100;
>     
>     private static final int S = 115;
>     private static final int T = 116;
>     private static final int R = 114;
>     private static final int A = 97;
>     private static final int M = 109;
>     
>     private static final int O = 111;
>     private static final int B = 98;
>     private static final int J = 106;
>     
>     private static boolean flag = true;
>     
>     /**
>      * This method will read through the current stream object until
>      * we find the keyword "endstream" meaning we're at the end of this
>      * object. Some pdf files, however, forget to write some endstream tags
>      * and just close off objects with an "endobj" tag so we have to handle
>      * this case as well.
>      * @param out The stream we write out to. 
>      * @throws IOException
>      */
>     private void readUntilEndStream( OutputStream out ) throws IOException{
>     	int byteRead;
>     	do{ //use a fail fast test for end of stream markers
>     		byteRead = pdfSource.read();
>     		if(byteRead==E){//only branch if "e"
>     			byteRead = pdfSource.read();
>     			if(byteRead==N){ //only continue branch if "en"
>     				byteRead = pdfSource.read();
>     				if(byteRead==D){//up to "end" now
>     					byteRead = pdfSource.read();
>     					if(byteRead==S){
>     						byteRead = pdfSource.read();
>     						if(byteRead==T){
>     							byteRead = pdfSource.read();
>     							if(byteRead==R){
>     								byteRead = pdfSource.read();
>     								if(byteRead==E){
>     									byteRead = pdfSource.read();
>     									if(byteRead==A){
>     										byteRead = pdfSource.read();
>     										if(byteRead==M){
>     											//found the whole marker
>     											pdfSource.unread( ENDSTREAM );
>     							                return;
>     										}
>     									}else{
>     										out.write(ENDSTREAM, 0, 7);
>     									}
>     								}else{
>     									out.write(ENDSTREAM, 0, 6);
>     								}
>     							}else{
>     								out.write(ENDSTREAM, 0, 5);
>     							}
>     						}else{
>         						out.write(ENDSTREAM, 0, 4);
>     						}
>     					}else if(byteRead==O){
>     						byteRead = pdfSource.read();
>     						if(byteRead==B){
>     							byteRead = pdfSource.read();
>     							if(byteRead==J){
>     								//found whole marker
>     								pdfSource.unread( ENDOBJ );
>     				                return;
>     							}
>     						}else{
>         						out.write(ENDOBJ, 0, 4);
>     						}
>     					}else{
>     						out.write(E);
>     						out.write(N);
>     						out.write(D);
>     					}
>     				}else{
>     					out.write(E);
>     					out.write(N);
>     				}
>     			}else{
>     				out.write(E);
>     			}
>     		}
>     		if(byteRead!=-1)out.write(byteRead);
>     	}while(byteRead!=-1);
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.