You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Qian Diao (JIRA)" <ji...@apache.org> on 2013/03/26 18:29:15 UTC
[jira] [Closed] (TIKA-1097) not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser‏‏

     [ https://issues.apache.org/jira/browse/TIKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Qian Diao closed TIKA-1097.
---------------------------

    Resolution: Invalid

I will split the bug to multiple ones.
                
> not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser‏‏
> ----------------------------------------------------------------
>
>                 Key: TIKA-1097
>                 URL: https://issues.apache.org/jira/browse/TIKA-1097
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1, 1.3
>         Environment: linux redhat 
>            Reporter: Qian Diao
>             Fix For: 1.3, 1.1
>
>
> Hi,
> I got some parsing problems when using Tika 1.1. Some pdfs, docs and ppts were not getting parsed.
> So, tried with 1.3. Still some pdfs/docs/ppts can not be parsed.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
>     private static final String validBoilerpipeFilenameRegEx = ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
>         public String parseFile(File inFile) {
>             if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;
>                    
>             InputStream is = null;
>             String outputText = "";
>             try {
>                 // Open input stream
>                 is = new FileInputStream(inFile);
>                 // Prepare parser
>                 BodyContentHandler contenthandler = new BodyContentHandler(-1);
>                 Metadata metadata = new Metadata();
>                 metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
>                 ParseContext pc = new ParseContext();
>                 // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse.
>                 if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
>                         Parser parser = new AutoDetectParser();
>                         parser.parse(is, contenthandler, metadata, pc);
>                 }
>                 else {
>                         Parser parser = new HtmlParser();
>                         BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
>                         parser.parse(is, bh, metadata, pc);
>                 }
>                 // Prepare text for write
>                 outputText = contenthandler.toString();        
>             } catch (Exception e) {
>                 System.out.println(e);
>                 return null;
>             } finally {
>                 try { 
>                     if (is != null) 
>                         is.close(); 
>                 } catch (Exception e) {}
>             }
>            
>             return outputText;
>         }
> ======
> output:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@3a6ac461
> url_4080_ETS11_TAGMatrix_rev070111.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2b03be0
> url_2275_Paper26Pages253-269.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f9a32e0
> url_5889_viz.96.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4e513d61
> url_1556_sensys_awoo03.pdf
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@426295eb
> url_5300_sudoku2.pdf?referrer=webcluster&
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7c2e1f1f
> url_1441_ChoosingYourFirstCSCourse2011-FINAL.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7eda18ac
> url_4272_20080218121324_723.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f0ffb38
> url_2491_2106_crime_scene.doc
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4cedf389
> url_5227_Romano-Library%20Research%20Series%20-%20March%2029%202007%20FINAL(small).ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6126f827
> url_5250_linked%20list.ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3749eb9f
> url_2011_undergrad-brochure.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3a289d2e
> url_5709_final_presentation_bak.ppt
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ddc0e7a
> url_5319_2011_2012_advising_guidelines.pdf
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@7dc5ddc9
> url_3502_TheEvolvingRoleTech.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4963f7a1
> url_2403_class_presentation_Btree.ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7ba85d38
> url_4040_fukunaga_jair07_bin.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6a8046f4
> url_2472_COP3530OverheadsF99.doc
> Thanks,
> Qian
>  
>        

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira