You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sergiy Karpenko <se...@exoplatform.com> on 2010/07/19 15:27:06 UTC
Problem with Tika configuration
Hello, friends
I want configure tika to use only PDFParser
So I make tika-config.xml with exact content:
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
And I have test
File file = getResourceAsFile("/test-documents/testPDF.pdf");
TikaConfig myTC = new
TikaConfig(getResourceAsFile("/test-documents/tika-config.xml"));
String s1 = ParseUtils.getStringContent(file, myTC);
It fails on last line
java.lang.NullPointerException
at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:111)
at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:170)
at
org.apache.tika.utils.ParseUtils.getStringContent(ParseUtils.java:188)
at org.apache.tika.TestParsers.testOwnPDFParser(TestParsers.java:60)
Debug shows that tika-config.xml contain incorrect configuration
Next one works fine:
<blabla>
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
</blabla>
Is there any documentation about Tika configuration, or at least a link to
correct and well formed tika-config.xml?
Thanks
Re: Problem with Tika configuration
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Jul 19, 2010 at 4:27 PM, Sergiy Karpenko
<se...@exoplatform.com> wrote:
> I want configure tika to use only PDFParser
The easiest way to achieve this is to directly use the PDFParser class
instead of working through the configuration.
> File file = getResourceAsFile("/test-documents/testPDF.pdf");
> TikaConfig myTC = new
> TikaConfig(getResourceAsFile("/test-documents/tika-config.xml"));
> String s1 = ParseUtils.getStringContent(file, myTC);
Use something like this instead:
Parser parser = new PDFParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
InputStream stream = TikaInputStream.get(new File("document.pdf"));
try {
parser.parse(stream, handler, metadata, context);
} finally {
stream.close();
}
String content = handler.toString();
BR,
Jukka Zitting