You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shoba Ramachandran <sh...@yahoo.com> on 2003/05/16 17:31:57 UTC
Indexing encrypted PDF documents using PDFBox-0.6.1
Hi,
Has anybody successfully indexed encrypted pdf
documents?
I get NullPointerException at
decryptor.decryptDocument( "" );
Thanks
Shoba
Code:
--------
public static Document pdfDocument(Document document,
File file) throws Exception
{
PDDocument pdDocument = null;
try
{
PDFParser parser = new PDFParser(new
FileInputStream(file));
parser.parse();
pdDocument = parser.getPDDocument();
System.out.println("pdDocument : " +
pdDocument);
if( pdDocument.isEncrypted() )
{
DecryptDocument decryptor = new
DecryptDocument( pdDocument );
System.out.println("decryptor : " +
decryptor);
//Just try using the default password
and move on
decryptor.decryptDocument( "" );
}
//create a tmp output stream with the size
of the content.
ByteArrayOutputStream out = new
ByteArrayOutputStream();
OutputStreamWriter writer = new
OutputStreamWriter( out );
PDFTextStripper stripper = new
PDFTextStripper();
stripper.writeText(
pdDocument.getDocument(), writer );
writer.close();
byte[] contents = out.toByteArray();
out.close();
InputStreamReader input = new
InputStreamReader( new ByteArrayInputStream( contents
) );
// Add the tag-stripped contents as a
Reader-valued Text field so it will
// get tokenized and indexed.
document.add(Field.Text("Contents", input
));
int summarySize = Math.min(
contents.length, 200 );
// Add the summary as an UnIndexed field,
so that it is stored and returned
// with hit documents for display.
//System.out.println(" ************** PDF
summary : " + new String( contents, 0, summarySize ));
document.add(Field.UnIndexed("Summary",
new String( contents, 0, summarySize ) ) );
//add the properties
//addProperties(document, pdDocument);
}
catch( CryptographyException e )
{
throw new IOException("Error decrypting
document(" + file.getPath() + "): " + e.getMessage());
}
catch( InvalidPasswordException e )
{
//they didn't suppply a password and the
default of "" was wrong.
throw new IOException("The document(" +
file.getPath() + ") is encrypted and will not be
indexed.");
}
finally
{
if(pdDocument!=null) pdDocument.close();
}
return document;
}
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing encrypted PDF documents using PDFBox-0.6.1
Posted by Ben Litchfield <be...@csh.rit.edu>.
This seems to be more of a PDFBox issue than a lucene issue. Please post
the stacktrace on the PDFBox mailing list. Also 0.6.2 is available which
fixed some bugs.
http://www.sourceforge.net/projects/pdfbox
http://www.pdfbox.org
Ben
On Fri, 16 May 2003, Shoba Ramachandran wrote:
> Hi,
>
> Has anybody successfully indexed encrypted pdf
> documents?
>
> I get NullPointerException at
>
> decryptor.decryptDocument( "" );
>
> Thanks
> Shoba
>
> Code:
> --------
> public static Document pdfDocument(Document document,
> File file) throws Exception
> {
> PDDocument pdDocument = null;
> try
> {
> PDFParser parser = new PDFParser(new
> FileInputStream(file));
> parser.parse();
>
> pdDocument = parser.getPDDocument();
> System.out.println("pdDocument : " +
> pdDocument);
>
> if( pdDocument.isEncrypted() )
> {
> DecryptDocument decryptor = new
> DecryptDocument( pdDocument );
> System.out.println("decryptor : " +
> decryptor);
> //Just try using the default password
> and move on
> decryptor.decryptDocument( "" );
> }
>
> //create a tmp output stream with the size
> of the content.
> ByteArrayOutputStream out = new
> ByteArrayOutputStream();
> OutputStreamWriter writer = new
> OutputStreamWriter( out );
> PDFTextStripper stripper = new
> PDFTextStripper();
> stripper.writeText(
> pdDocument.getDocument(), writer );
> writer.close();
>
> byte[] contents = out.toByteArray();
> out.close();
> InputStreamReader input = new
> InputStreamReader( new ByteArrayInputStream( contents
> ) );
>
> // Add the tag-stripped contents as a
> Reader-valued Text field so it will
> // get tokenized and indexed.
> document.add(Field.Text("Contents", input
> ));
>
> int summarySize = Math.min(
> contents.length, 200 );
> // Add the summary as an UnIndexed field,
> so that it is stored and returned
> // with hit documents for display.
> //System.out.println(" ************** PDF
> summary : " + new String( contents, 0, summarySize ));
> document.add(Field.UnIndexed("Summary",
> new String( contents, 0, summarySize ) ) );
>
> //add the properties
> //addProperties(document, pdDocument);
> }
> catch( CryptographyException e )
> {
> throw new IOException("Error decrypting
> document(" + file.getPath() + "): " + e.getMessage());
> }
> catch( InvalidPasswordException e )
> {
> //they didn't suppply a password and the
> default of "" was wrong.
> throw new IOException("The document(" +
> file.getPath() + ") is encrypted and will not be
> indexed.");
> }
> finally
> {
> if(pdDocument!=null) pdDocument.close();
> }
> return document;
> }
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org