You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Phani Kumar Samudrala <ph...@arisglobal.co.in> on 2013/02/12 11:22:47 UTC

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?

Here is the exception:


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
      at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more


Here is the code snippet in JAVA:


String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
                                     File file = new File(fileString );
                                     URL url = file.toURI().toURL();

                                     ParseContext context = new ParseContext();;
                                     Detector detector = new DefaultDetector();;
                                     Parser parser =  new AutoDetectParser(detector);;
                                     Metadata metadata = new Metadata();
                                     context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
                                     ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
                                                InputStream input = TikaInputStream.get(url, metadata);
                                                ContentHandler handler = new BodyContentHandler(outputstream);
                                                parser.parse(input, handler, metadata, context);

                                                input.close();
                                                outputstream.close();


Thanks

________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.

RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Posted by Phani Kumar Samudrala <ph...@arisglobal.co.in>.

Sorry, Just realized, seems I posted to wrong mailing list. Please ignore this.

-----Original Message-----
From: Phani Kumar Samudrala [mailto:phanikumar.s@arisglobal.co.in]
Sent: Tuesday, February 12, 2013 3:53 PM
To: dev@tika.apache.org
Subject: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?

Here is the exception:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
      at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more

Here is the code snippet in JAVA:

String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
                                     File file = new File(fileString );
                                     URL url = file.toURI().toURL();

                                     ParseContext context = new ParseContext();;
                                     Detector detector = new DefaultDetector();;
                                     Parser parser =  new AutoDetectParser(detector);;
                                     Metadata metadata = new Metadata();
                                     context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
                                     ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
                                                InputStream input = TikaInputStream.get(url, metadata);
                                                ContentHandler handler = new BodyContentHandler(outputstream);
                                                parser.parse(input, handler, metadata, context);

                                                input.close();
                                                outputstream.close();

Thanks

________________________________

Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.

________________________________

Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.

RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Posted by Phani Kumar Samudrala <ph...@arisglobal.co.in>.

When I tried to open the file in Acrobat Reader it says "You are viewing this document in PDF/A mode".

I am not sure about PDF/A mode, just wondering if this is anything to do with the issue?


-----Original Message-----
From: Phani Kumar Samudrala [mailto:phanikumar.s@arisglobal.co.in]
Sent: Tuesday, February 12, 2013 4:59 PM
To: Markus Jelsma; user@tika.apache.org
Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Hi

I just tried with Tika 1.3 (and I see that it got upgraded PDFBox to 1.7.1), But I am getting the same error.

In both the cases, Tika 1.2 or Tika 1.3, when I just replace tika-parsers.jar with the one from 1.0, it started working fine.

Not sure, if the problem lies in Tika or PDFBox. Any idea?


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2a15cd
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at com.arisglobal.agcommon.agsolr.util.TikaIndexTest.main(TikaIndexTest.java:37)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
        at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:178)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 3 more


-Phani
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Tuesday, February 12, 2013 4:03 PM
To: user@tika.apache.org; Phani Kumar Samudrala
Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed many issues with PDF parsing.

Cheers,


-----Original message-----
> From:Phani Kumar Samudrala <ph...@arisglobal.co.in>
> Sent: Tue 12-Feb-2013 11:30
> To: user@tika.apache.org
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
>
>
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?
>
> Here is the exception:
>
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
>       at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>       at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>       at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 3 more
>
>
> Here is the code snippet in JAVA:
>
>
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>                                      File file = new File(fileString );
>                                      URL url = file.toURI().toURL();
>
>                                      ParseContext context = new ParseContext();;
>                                      Detector detector = new DefaultDetector();;
>                                      Parser parser =  new AutoDetectParser(detector);;
>                                      Metadata metadata = new Metadata();
>                                      context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
>                                      ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
>                                                 InputStream input = TikaInputStream.get(url, metadata);
>                                                 ContentHandler handler = new BodyContentHandler(outputstream);
>                                                 parser.parse(input, handler, metadata, context);
>
>                                                 input.close();
>                                                 outputstream.close();
>
>
> Thanks
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
>
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
>
>

________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.


________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.

RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Posted by Phani Kumar Samudrala <ph...@arisglobal.co.in>.

Hi

I just tried with Tika 1.3 (and I see that it got upgraded PDFBox to 1.7.1), But I am getting the same error.

In both the cases, Tika 1.2 or Tika 1.3, when I just replace tika-parsers.jar with the one from 1.0, it started working fine.

Not sure, if the problem lies in Tika or PDFBox. Any idea?


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2a15cd
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at com.arisglobal.agcommon.agsolr.util.TikaIndexTest.main(TikaIndexTest.java:37)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
        at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:178)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:450)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 3 more


-Phani
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Tuesday, February 12, 2013 4:03 PM
To: user@tika.apache.org; Phani Kumar Samudrala
Subject: RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed many issues with PDF parsing.

Cheers,


-----Original message-----
> From:Phani Kumar Samudrala <ph...@arisglobal.co.in>
> Sent: Tue 12-Feb-2013 11:30
> To: user@tika.apache.org
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
>
>
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?
>
> Here is the exception:
>
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
>       at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>       at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>       at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 3 more
>
>
> Here is the code snippet in JAVA:
>
>
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>                                      File file = new File(fileString );
>                                      URL url = file.toURI().toURL();
>
>                                      ParseContext context = new ParseContext();;
>                                      Detector detector = new DefaultDetector();;
>                                      Parser parser =  new AutoDetectParser(detector);;
>                                      Metadata metadata = new Metadata();
>                                      context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
>                                      ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
>                                                 InputStream input = TikaInputStream.get(url, metadata);
>                                                 ContentHandler handler = new BodyContentHandler(outputstream);
>                                                 parser.parse(input, handler, metadata, context);
>
>                                                 input.close();
>                                                 outputstream.close();
>
>
> Thanks
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
>
>
> ________________________________
>
>
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
>
>

________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.

RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed many issues with PDF parsing.

Cheers,
 
 
-----Original message-----
> From:Phani Kumar Samudrala <ph...@arisglobal.co.in>
> Sent: Tue 12-Feb-2013 11:30
> To: user@tika.apache.org
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
> 
> 
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?
> 
> Here is the exception:
> 
> 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
>       at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>       at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>       at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>       at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>       at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 3 more
> 
> 
> Here is the code snippet in JAVA:
> 
> 
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>                                      File file = new File(fileString );
>                                      URL url = file.toURI().toURL();
> 
>                                      ParseContext context = new ParseContext();;
>                                      Detector detector = new DefaultDetector();;
>                                      Parser parser =  new AutoDetectParser(detector);;
>                                      Metadata metadata = new Metadata();
>                                      context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
>                                      ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
>                                                 InputStream input = TikaInputStream.get(url, metadata);
>                                                 ContentHandler handler = new BodyContentHandler(outputstream);
>                                                 parser.parse(input, handler, metadata, context);
> 
>                                                 input.close();
>                                                 outputstream.close();
> 
> 
> Thanks
> 
> ________________________________
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
> 
> 
> ________________________________
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.
> 
>

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Posted by Phani Kumar Samudrala <ph...@arisglobal.co.in>.

I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this?

Here is the exception:


org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<ma...@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
      at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more


Here is the code snippet in JAVA:


String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
                                     File file = new File(fileString );
                                     URL url = file.toURI().toURL();

                                     ParseContext context = new ParseContext();;
                                     Detector detector = new DefaultDetector();;
                                     Parser parser =  new AutoDetectParser(detector);;
                                     Metadata metadata = new Metadata();
                                     context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
                                     ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
                                                InputStream input = TikaInputStream.get(url, metadata);
                                                ContentHandler handler = new BodyContentHandler(outputstream);
                                                parser.parse(input, handler, metadata, context);

                                                input.close();
                                                outputstream.close();


Thanks

________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.


________________________________


Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.