You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by al...@aim.com on 2015/04/01 00:04:06 UTC

parsing mime-type text/html with parse-tika

Hello,   
   
   
   
I try to use nutch-2.x trunk to parse text/html types with tika. 
   
I get error "parser for     text/html not found".  
   
   
   
   
   I see that parse-tika code was changed. These lines  
   
   
   
   
   // get the right parser using the mime type as a clue  
   
   
    String mimeType = page.getContentType().toString();
    CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
    Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
return no parser.   
   
   
   
   
However, if I revert back to older version with  
   
   
 // get the right parser using the mime type as a clue
    String mimeType = page.getContentType().toString();
    Parser parser = tikaConfig.getParser(mimeType);
   
   
it works.
   
Has anyone tested the new tika with text/html types?
   
Thanks.
   
Alex.