You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Sergiy Karpenko <se...@exoplatform.com> on 2010/07/19 15:27:06 UTC

Problem with Tika configuration

Hello, friends

I want configure tika to use only PDFParser

So I make  tika-config.xml with exact content:

<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
  <mime>application/pdf</mime>
</parser>

And I have test

      File file = getResourceAsFile("/test-documents/testPDF.pdf");
      TikaConfig myTC = new
TikaConfig(getResourceAsFile("/test-documents/tika-config.xml"));
      String s1 = ParseUtils.getStringContent(file, myTC);

It fails on last line
java.lang.NullPointerException
    at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:111)
    at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:170)
    at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:188)
    at org.apache.tika.TestParsers.testOwnPDFParser(TestParsers.java:60)

Debug shows that tika-config.xml contain incorrect configuration

Next one works fine:
<blabla>
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
  <mime>application/pdf</mime>
</parser>
</blabla>

Is there any documentation about Tika configuration, or at least a link to
correct and well formed tika-config.xml?

Thanks

Re: Problem with Tika configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Jul 19, 2010 at 4:27 PM, Sergiy Karpenko
<se...@exoplatform.com> wrote:
> I want configure tika to use only PDFParser

The easiest way to achieve this is to directly use the PDFParser class
instead of working through the configuration.

>       File file = getResourceAsFile("/test-documents/testPDF.pdf");
>       TikaConfig myTC = new
> TikaConfig(getResourceAsFile("/test-documents/tika-config.xml"));
>       String s1 = ParseUtils.getStringContent(file, myTC);

Use something like this instead:

    Parser parser = new PDFParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    InputStream stream = TikaInputStream.get(new File("document.pdf"));
    try {
        parser.parse(stream, handler, metadata, context);
    } finally {
        stream.close();
    }

    String content = handler.toString();

BR,

Jukka Zitting