You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Daniel Gibby <dg...@edirectpublishing.com> on 2014/07/01 20:20:25 UTC

IOException should be something more specific?

Using Tika 1.5 (latest release which uses PDFBox) I'm seeing the 
following IOException parsing certain PDFs.

java.io.IOException: Error: Header doesn't contain versioninfo
    at 
org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
...

Should this be something more specific than just an IOException, so that 
Tika can know whether to just let it bubble up as an IOException, or 
encapsulate it into a TikaException?

I don't know enough about the PDFBox project to know if there are ever 
any exceptions besides IOExceptions thrown. Perhaps there could be a 
PDFParseException or something like that when you run into known 
situations. But if IOExceptions only ever happen when you run into known 
situations, then Tika could just know that is the case and wrap any 
IOException from PDFBox into a TikaException.

What do you think?

Thanks,
Daniel Gibby

Re: IOException should be something more specific?

Posted by Daniel Gibby <dg...@edirectpublishing.com>.

Yes, having more specific exceptions is usually helpful, as it allows 
the code to handle various cases differently.
Otherwise, would there ever be a need for any type of exception except 
the base Exception class?

In this case, having a PDFParseException, or at least a ParseException 
provided by PDFBox instead of just an IOException would tell us that 
there is no problem with the file itself or the input and output of the 
file into the parser, but that something went wrong with the parsing.

When it becomes a TikaException instead of an IOException is when it 
becomes the most useful, because that then allows my software to 
distinguish between an event caused by parsing versus some general 
problem with the file. Imagine if it wasn't an IOException and was just 
an Exception. Then my programming would have to be even more generic and 
not be able to handle the exception as specifically.

On 7/11/2014 2:08 AM, James Green wrote:
> This raises an interesting question, and one that applies to software in
> general. I actually think PDFBox has it right - something more specific
> might sound correct but to whom is it is useful? Exceptions in my
> experience tend to bubble straight to the user (perhaps logged to file, and
> an "oops" given to the user). The user in this case needs to be told
> there's something wrong with the file, and the error itself says what.
>
> Does PDFParseException give your software some new behaviour?
>
>

Re: IOException should be something more specific?

Posted by James Green <ja...@gmail.com>.

This raises an interesting question, and one that applies to software in
general. I actually think PDFBox has it right - something more specific
might sound correct but to whom is it is useful? Exceptions in my
experience tend to bubble straight to the user (perhaps logged to file, and
an "oops" given to the user). The user in this case needs to be told
there's something wrong with the file, and the error itself says what.

Does PDFParseException give your software some new behaviour?



On 1 July 2014 19:20, Daniel Gibby <dg...@edirectpublishing.com> wrote:

> Using Tika 1.5 (latest release which uses PDFBox) I'm seeing the following
> IOException parsing certain PDFs.
>
> java.io.IOException: Error: Header doesn't contain versioninfo
>    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(
> PDFParser.java:335)
>    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177)
>    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
> ...
>
> Should this be something more specific than just an IOException, so that
> Tika can know whether to just let it bubble up as an IOException, or
> encapsulate it into a TikaException?
>
> I don't know enough about the PDFBox project to know if there are ever any
> exceptions besides IOExceptions thrown. Perhaps there could be a
> PDFParseException or something like that when you run into known
> situations. But if IOExceptions only ever happen when you run into known
> situations, then Tika could just know that is the case and wrap any
> IOException from PDFBox into a TikaException.
>
> What do you think?
>
> Thanks,
> Daniel Gibby
>