You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "David Pilato (JIRA)" <ji...@apache.org> on 2016/12/16 08:16:58 UTC

[jira] [Comment Edited] (TIKA-2208) Catch missing libraires

    [ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753790#comment-15753790 ] 

David Pilato edited comment on TIKA-2208 at 12/16/16 8:16 AM:
--------------------------------------------------------------

So I tried this way.

Basically I declared `<service-loader loadErrorHandler="IGNORE"/>`

But when I looked at what is happening, this is only used when you build the Tika instance.
When you use it with `parseToString` method for example, this service-loader is not used.

Here the problem is happening when Tika tries to parse a Word document which is containing a Visio schema. At parsing time, not at initializing time.

Here the cause of the issue is clearly coming from our side. We removed a needed library (com.github.virtuald:curvesapi:1.04). 

{code}
_transitive_org.apache.poi:poi-ooxml:3.15
\--- org.apache.poi:poi-ooxml:3.15
     +--- org.apache.poi:poi:3.15
     |    +--- commons-codec:commons-codec:1.10
     |    \--- org.apache.commons:commons-collections4:4.1
     +--- org.apache.poi:poi-ooxml-schemas:3.15
     |    \--- org.apache.xmlbeans:xmlbeans:2.6.0
     |         \--- stax:stax-api:1.0.1
     \--- com.github.virtuald:curvesapi:1.04
{code}

But would it be possible for Tika to catch some end user errors and send a more friendly exception?


was (Author: dadoonet):
So I tried this way.

Basically I declared `<service-loader loadErrorHandler="IGNORE"/>`

But when I looked at what is happening, this is only used when you build the Tika instance.
When you use it with `parseToString` method for example, this service-loader is not used.

Here the problem is happening when Tika tries to parse a Word document which is containing a Visio schema. At parsing time, not at initializing time.

Here the cause of the issue is clearly coming from our side. We removed a needed library (`com.github.virtuald:curvesapi:1.04`). 

```
_transitive_org.apache.poi:poi-ooxml:3.15
\--- org.apache.poi:poi-ooxml:3.15
     +--- org.apache.poi:poi:3.15
     |    +--- commons-codec:commons-codec:1.10
     |    \--- org.apache.commons:commons-collections4:4.1
     +--- org.apache.poi:poi-ooxml-schemas:3.15
     |    \--- org.apache.xmlbeans:xmlbeans:2.6.0
     |         \--- stax:stax-api:1.0.1
     \--- com.github.virtuald:curvesapi:1.04
```

But would it be possible for Tika to catch some end user errors and send a more friendly exception?

> Catch missing libraires
> -----------------------
>
>                 Key: TIKA-2208
>                 URL: https://issues.apache.org/jira/browse/TIKA-2208
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract text and metadata.
> We defined our list of Parsers:
> {code:java}
>     private static final Parser PARSERS[] = new Parser[] {
>         // documents
>         new org.apache.tika.parser.html.HtmlParser(),
>         new org.apache.tika.parser.rtf.RTFParser(),
>         new org.apache.tika.parser.pdf.PDFParser(),
>         new org.apache.tika.parser.txt.TXTParser(),
>         new org.apache.tika.parser.microsoft.OfficeParser(),
>         new org.apache.tika.parser.microsoft.OldExcelParser(),
>         new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
>         new org.apache.tika.parser.odf.OpenDocumentParser(),
>         new org.apache.tika.parser.iwork.IWorkPackageParser(),
>         new org.apache.tika.parser.xml.DcXMLParser(),
>         new org.apache.tika.parser.epub.EpubParser(),
>     };
>     private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
>     private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document (Like a Visio Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a {{TikaException}} so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)