You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rolando Bermudez Peña <rb...@uci.cu> on 2009/01/28 07:00:53 UTC

error fetching pdf

Hello all,

When crawling my intranet I encounter with several errors like the following.


fetch of http://intranet/pdf/fund_ admin_fin_2.pdf failed with: java.lang.IllegalArgumentException: Invalid uri 
'http://intranet/pdf/fund_ admin_fin_2.pdf': escaped absolute path not valid


Error parsing: http://intranet/pdf/infotech.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption

Error parsing: http://intranet/pdf/ronda_pupo.pdf: failed(2,0): Can't be handled as pdf document. java.io.IOException: Error: expected the end of a dictionary.


Any ideas what is causing this, perhaps is a bad configuration?

Regards,
Rolando

 

Re: error fetching pdf

Posted by Doğacan Güney <do...@gmail.com>.
On Wed, Jan 28, 2009 at 8:00 AM, Rolando Bermudez Peña <rb...@uci.cu> wrote:
> Hello all,
>
> When crawling my intranet I encounter with several errors like the following.
>
>
> fetch of http://intranet/pdf/fund_ admin_fin_2.pdf failed with: java.lang.IllegalArgumentException: Invalid uri
> 'http://intranet/pdf/fund_ admin_fin_2.pdf': escaped absolute path not valid
>

This url contains spaces and nutch rejects it as an invalid URL, I think.

>
> Error parsing: http://intranet/pdf/infotech.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
>
> Error parsing: http://intranet/pdf/ronda_pupo.pdf: failed(2,0): Can't be handled as pdf document. java.io.IOException: Error: expected the end of a dictionary.
>
>
> Any ideas what is causing this, perhaps is a bad configuration?
>
> Regards,
> Rolando
>
>
>



-- 
Doğacan Güney