You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shoba Ramachandran <sh...@yahoo.com> on 2003/05/16 17:31:57 UTC

Indexing encrypted PDF documents using PDFBox-0.6.1

Hi,

Has anybody successfully indexed encrypted pdf
documents?

I get NullPointerException at

decryptor.decryptDocument( "" );

Thanks
Shoba

Code:
--------
public static Document pdfDocument(Document document,
File file) throws Exception
    {
        PDDocument pdDocument = null;
        try
        {
            PDFParser parser = new PDFParser(new
FileInputStream(file));
            parser.parse();

            pdDocument = parser.getPDDocument();
            System.out.println("pdDocument :  " +
pdDocument);

            if( pdDocument.isEncrypted() )
            {
                DecryptDocument decryptor = new
DecryptDocument( pdDocument );
                System.out.println("decryptor :  " +
decryptor);
                //Just try using the default password
and move on
                decryptor.decryptDocument( "" );
            }

            //create a tmp output stream with the size
of the content.
            ByteArrayOutputStream out = new
ByteArrayOutputStream();
            OutputStreamWriter writer = new
OutputStreamWriter( out );
            PDFTextStripper stripper = new
PDFTextStripper();
            stripper.writeText(
pdDocument.getDocument(), writer );
            writer.close();

            byte[] contents = out.toByteArray();
            out.close();
            InputStreamReader input = new
InputStreamReader( new ByteArrayInputStream( contents
) );

            // Add the tag-stripped contents as a
Reader-valued Text field so it will
            // get tokenized and indexed.
            document.add(Field.Text("Contents", input
));

            int summarySize = Math.min(
contents.length, 200 );
            // Add the summary as an UnIndexed field,
so that it is stored and returned
            // with hit documents for display.
            //System.out.println(" ************** PDF
summary : " + new String( contents, 0, summarySize ));
            document.add(Field.UnIndexed("Summary",
new String( contents, 0, summarySize ) ) );

            //add the properties
            //addProperties(document, pdDocument);
        }
        catch( CryptographyException e )
        {
            throw new IOException("Error decrypting
document(" + file.getPath() + "): " + e.getMessage());
        }
        catch( InvalidPasswordException e )
        {
            //they didn't suppply a password and the
default of "" was wrong.
            throw new IOException("The document(" +
file.getPath() + ") is encrypted and will not be
indexed.");
        }
        finally
        {
            if(pdDocument!=null) pdDocument.close();
        }
        return document;
    }

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing encrypted PDF documents using PDFBox-0.6.1

Posted by Ben Litchfield <be...@csh.rit.edu>.
This seems to be more of a PDFBox issue than a lucene issue.  Please post
the stacktrace on the PDFBox mailing list.  Also 0.6.2 is available which
fixed some bugs.

http://www.sourceforge.net/projects/pdfbox
http://www.pdfbox.org

Ben


On Fri, 16 May 2003, Shoba Ramachandran wrote:

> Hi,
>
> Has anybody successfully indexed encrypted pdf
> documents?
>
> I get NullPointerException at
>
> decryptor.decryptDocument( "" );
>
> Thanks
> Shoba
>
> Code:
> --------
> public static Document pdfDocument(Document document,
> File file) throws Exception
>     {
>         PDDocument pdDocument = null;
>         try
>         {
>             PDFParser parser = new PDFParser(new
> FileInputStream(file));
>             parser.parse();
>
>             pdDocument = parser.getPDDocument();
>             System.out.println("pdDocument :  " +
> pdDocument);
>
>             if( pdDocument.isEncrypted() )
>             {
>                 DecryptDocument decryptor = new
> DecryptDocument( pdDocument );
>                 System.out.println("decryptor :  " +
> decryptor);
>                 //Just try using the default password
> and move on
>                 decryptor.decryptDocument( "" );
>             }
>
>             //create a tmp output stream with the size
> of the content.
>             ByteArrayOutputStream out = new
> ByteArrayOutputStream();
>             OutputStreamWriter writer = new
> OutputStreamWriter( out );
>             PDFTextStripper stripper = new
> PDFTextStripper();
>             stripper.writeText(
> pdDocument.getDocument(), writer );
>             writer.close();
>
>             byte[] contents = out.toByteArray();
>             out.close();
>             InputStreamReader input = new
> InputStreamReader( new ByteArrayInputStream( contents
> ) );
>
>             // Add the tag-stripped contents as a
> Reader-valued Text field so it will
>             // get tokenized and indexed.
>             document.add(Field.Text("Contents", input
> ));
>
>             int summarySize = Math.min(
> contents.length, 200 );
>             // Add the summary as an UnIndexed field,
> so that it is stored and returned
>             // with hit documents for display.
>             //System.out.println(" ************** PDF
> summary : " + new String( contents, 0, summarySize ));
>             document.add(Field.UnIndexed("Summary",
> new String( contents, 0, summarySize ) ) );
>
>             //add the properties
>             //addProperties(document, pdDocument);
>         }
>         catch( CryptographyException e )
>         {
>             throw new IOException("Error decrypting
> document(" + file.getPath() + "): " + e.getMessage());
>         }
>         catch( InvalidPasswordException e )
>         {
>             //they didn't suppply a password and the
> default of "" was wrong.
>             throw new IOException("The document(" +
> file.getPath() + ") is encrypted and will not be
> indexed.");
>         }
>         finally
>         {
>             if(pdDocument!=null) pdDocument.close();
>         }
>         return document;
>     }
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org