You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by varun bhansaly <vb...@gmail.com> on 2011/03/15 08:13:53 UTC

Unable to convert valid pdf to html

Hi,
Encountered an exception while converting a pdf to HTML/ text using
pdfbox-app-1.5.0.
The file in this case is "team21_devel.pdf", please note this is a valid PDF
as it gets opened in adobe reader.

I have used the command line utility as
java -jar pdfbox-app-1.5.0.jar ExtractText -html team21_devel.pdf
The Exception :
ExtractText failed with the following exception:
java.io.IOException: Expected='null' actual='nullnullnull'
    at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1025)
    at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:802)
    at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1011)
    at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:179)
    at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:292)
    at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1000)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:533)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:180)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:881)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:771)
    at org.apache.pdfbox.ExtractText.main(ExtractText.java:179)
    at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)

Do let me know if any other information is required.
If someone has a solution, then do share.

-- 
Regards,
Varun Bhansaly

Re: Unable to convert valid pdf to html

Posted by varun bhansaly <vb...@gmail.com>.
Hi Thomas,
It opens fine with acrobat X on Win7 X86_64, acrobat 9 on ubuntu X86_64,
evince on ubuntu X86_64.

I was actually more surprised by the trace error message "java.io.IOException:
Expected='null' actual='nullnullnull'".
Anyways, thanks for looking into it.

On Wed, Mar 16, 2011 at 1:10 PM, Thomas Fischer <fi...@aon.at> wrote:

> Hi Varun,
>
> whatever it is, there is something wrong with this file.
> On my Mac, Acrobat Reader 6, Preview and Skim can't open the file, and
> Acrobat Reader 9 starts with the message
> "The document is damaged but will be repaired." (translated from German)
> JHOVE claims it is well-formed and valid.
> But anyway, I think the (long term) goal is that PDFBox should be able to
> read everything that Acrobat Reader can read (not only display).
>
> Best
> Thomas
>
> Am 16.03.2011 um 02:31 schrieb varun bhansaly:
>
> > Hi Thomas,
> > Thanks for the reply, have created a JIRA issue
> > https://issues.apache.org/jira/browse/PDFBOX-982
> >
> > On Wed, Mar 16, 2011 at 3:38 AM, Thomas Fischer <fi...@aon.at>
> wrote:
> >
> >> Hello Varun,
> >>
> >> I can't tell you much about the error, just want to note that
> >>
> >>> The file in this case is "team21_devel.pdf", please note this is a
> valid
> >> PDF
> >>> as it gets opened in adobe reader.
> >>
> >> definitely doesn't guarantee that this is a valid PDF file (as in
> >> "conforming to given standards").
> >> Since it is hard to know what is going on without the file, and this
> >> mailing list doesn't accept attachments, you could either provide a URL
> for
> >> the file or create an issue at PDFbox's Jira:
> >> https://issues.apache.org/jira/browse/
> >>
> >> Regards
> >> Thomas
> >>
> >>
> >>
> >
> >
> > --
> > Regards,
> > Varun Bhansaly
>
> Mit freundlichen Grüßen
> Thomas Fischer
>
>
>


-- 
Regards,
Varun Bhansaly

Re: Unable to convert valid pdf to html

Posted by Thomas Fischer <fi...@aon.at>.
Hi Varun,

whatever it is, there is something wrong with this file.
On my Mac, Acrobat Reader 6, Preview and Skim can't open the file, and Acrobat Reader 9 starts with the message
"The document is damaged but will be repaired." (translated from German)
JHOVE claims it is well-formed and valid.
But anyway, I think the (long term) goal is that PDFBox should be able to read everything that Acrobat Reader can read (not only display).

Best
Thomas

Am 16.03.2011 um 02:31 schrieb varun bhansaly:

> Hi Thomas,
> Thanks for the reply, have created a JIRA issue
> https://issues.apache.org/jira/browse/PDFBOX-982
> 
> On Wed, Mar 16, 2011 at 3:38 AM, Thomas Fischer <fi...@aon.at> wrote:
> 
>> Hello Varun,
>> 
>> I can't tell you much about the error, just want to note that
>> 
>>> The file in this case is "team21_devel.pdf", please note this is a valid
>> PDF
>>> as it gets opened in adobe reader.
>> 
>> definitely doesn't guarantee that this is a valid PDF file (as in
>> "conforming to given standards").
>> Since it is hard to know what is going on without the file, and this
>> mailing list doesn't accept attachments, you could either provide a URL for
>> the file or create an issue at PDFbox's Jira:
>> https://issues.apache.org/jira/browse/
>> 
>> Regards
>> Thomas
>> 
>> 
>> 
> 
> 
> -- 
> Regards,
> Varun Bhansaly

Mit freundlichen Grüßen
Thomas Fischer



Re: Unable to convert valid pdf to html

Posted by varun bhansaly <vb...@gmail.com>.
Hi Thomas,
Thanks for the reply, have created a JIRA issue
https://issues.apache.org/jira/browse/PDFBOX-982

On Wed, Mar 16, 2011 at 3:38 AM, Thomas Fischer <fi...@aon.at> wrote:

> Hello Varun,
>
> I can't tell you much about the error, just want to note that
>
> > The file in this case is "team21_devel.pdf", please note this is a valid
> PDF
> > as it gets opened in adobe reader.
>
> definitely doesn't guarantee that this is a valid PDF file (as in
> "conforming to given standards").
> Since it is hard to know what is going on without the file, and this
> mailing list doesn't accept attachments, you could either provide a URL for
> the file or create an issue at PDFbox's Jira:
> https://issues.apache.org/jira/browse/
>
> Regards
> Thomas
>
>
>


-- 
Regards,
Varun Bhansaly

Re: Unable to convert valid pdf to html

Posted by Thomas Fischer <fi...@aon.at>.
Hello Varun,

I can't tell you much about the error, just want to note that

> The file in this case is "team21_devel.pdf", please note this is a valid PDF
> as it gets opened in adobe reader.

definitely doesn't guarantee that this is a valid PDF file (as in "conforming to given standards").
Since it is hard to know what is going on without the file, and this mailing list doesn't accept attachments, you could either provide a URL for the file or create an issue at PDFbox's Jira:
https://issues.apache.org/jira/browse/

Regards
Thomas