You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2015/04/01 00:04:06 UTC
parsing mime-type text/html with parse-tika
Hello,
I try to use nutch-2.x trunk to parse text/html types with tika.
I get error "parser for text/html not found".
I see that parse-tika code was changed. These lines
// get the right parser using the mime type as a clue
String mimeType = page.getContentType().toString();
CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
return no parser.
However, if I revert back to older version with
// get the right parser using the mime type as a clue
String mimeType = page.getContentType().toString();
Parser parser = tikaConfig.getParser(mimeType);
it works.
Has anyone tested the new tika with text/html types?
Thanks.
Alex.