You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Luis Perruca <l....@gmail.com> on 2009/11/02 16:40:57 UTC
Problems with parse and split

Hi All.

I have a problem while splitting a PDF.

The scenario:
One single pdf document with about 700 pages, each one with the payroll 
of one employee.
My goal is to get one pdf for each page, and the name of each file must 
contain the employee id , the month and the year.

My first idea is to parse the original document, iterating each page, 
extracting the info and saving a document, as shown in the code.
Using the code listed below in List 1, it works but, if the original 
file has about 7 MB, each of the resulting page has about 3.5MB.

¿Why is it possible?, ¿I'm not doing it right?, ¿any suggestions?

Thanks.

List 1

        PDDocument doc=null;
        try
        {
            doc = PDDocument.load(inputFile, false);
            List pages = doc.getDocumentCatalog().getAllPages();
         

            for( int i=0; i<pages.size(); i++ )
            {

                String NIF="";
                String monthYear="";
              
                PDPage page = (PDPage)pages.get( i );
                PDStream contents = page.getContents();
          
                PDFStreamParser parser = new 
PDFStreamParser(contents.getStream() );
                parser.parse();
                List tokens = parser.getTokens();
               
                DocumentInfo documentInfo=new DocumentInfo();
               
               
                for( int j=0; j<tokens.size(); j++ ) {
                    Object next = tokens.get( j );
                    if( next instanceof PDFOperator )
                    {
                        PDFOperator op = (PDFOperator)next;
                        //Tj and TJ are the two operators that display
                        //strings in a PDF
                        if( (op.getOperation().equals( "TJ" )) || 
(op.getOperation().equals("Tj")))
                        {
                            //Tj takes one operator and that is the string
                            //to display so lets update that operator
                            COSString previous = (COSString)tokens.get( 
j-1 );
                            String string = previous.getString();
                           
                          
//                            System.out.println(string); // From string 
I get the info

                            ......Some code to obtain the info from the 
page ........
                            if(... the string is the id)
                               NIF=string;
                            if(.... the string is the month-Year)
                               monthYear=string

                        }
                    }
                   
                } //for
               
                PDDocument docPagina=null;
                try {
                    docPagina=new PDDocument();
                   
                    docPagina.addPage(page);
                   
                    docPagina.save(NIF + "-" + monthYear + ".pdf");
                    docPagina.close();
                }
                finally {
                    if(docPagina!=null) {
                        docPagina.close();
                    }
                }
               
            }
        }
        finally
        {
            if( doc != null )
            {
                doc.close();
                File fichero=new File(inputFile);
                fichero.delete();   //Delete the original File.
            }
        }