You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Dmitry Minkovsky <dm...@gmail.com> on 2015/03/10 02:59:58 UTC
Facade uses the EmptyParser despite correct type detection
I am trying to use the Tika facade. Here's my test code:
Tika tika = new Tika();
Metadata md = new Metadata();
try {
String content = tika.parseToString(src, md, 100000);
System.out.println("Content length: " + content.length());
for (String s: md.names()) {
System.out.println(s + ": " + md.get(s));
}
}
catch (TikaException e) { System.out.println(e); }
Here's the output:
> Content length: 0
> X-Parsed-By: org.apache.tika.parser.EmptyParser
> Content-Type: text/html
So:
* If Tika correctly identifies the input as text/html, why does it use the
EmptyParser?
* If I'm supposed to pass a parser, which parser should I pass for best
results, assuming that autodetection is successful, as it seems to be above.
Thank you,
Dmitry
Re: Facade uses the EmptyParser despite correct type detection
Posted by Dmitry Minkovsky <dm...@gmail.com>.
Pardon the interruption: I did not have tika-parsers on my classpath!
Thank you,
Dmitry
On Mon, Mar 9, 2015 at 9:59 PM, Dmitry Minkovsky <dm...@gmail.com>
wrote:
> I am trying to use the Tika facade. Here's my test code:
>
>
> Tika tika = new Tika();
> Metadata md = new Metadata();
>
> try {
> String content = tika.parseToString(src, md, 100000);
>
> System.out.println("Content length: " + content.length());
>
> for (String s: md.names()) {
> System.out.println(s + ": " + md.get(s));
> }
> }
> catch (TikaException e) { System.out.println(e); }
>
>
> Here's the output:
>
> > Content length: 0
> > X-Parsed-By: org.apache.tika.parser.EmptyParser
> > Content-Type: text/html
>
> So:
>
> * If Tika correctly identifies the input as text/html, why does it use the
> EmptyParser?
> * If I'm supposed to pass a parser, which parser should I pass for best
> results, assuming that autodetection is successful, as it seems to be above.
>
> Thank you,
> Dmitry
>