You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Felix von Zadow <Fe...@mgm-tp.com> on 2016/09/30 12:04:25 UTC

Tika removes tags which I'd prefer to keep.

Hi!

I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML parsing. I searched the archive but couldn't find anything.

I would like to use parse-tika for parsing HTML and later indexing to Solr. While parsing, tika seems to remove quite a number of HTML tags and attributes. While this does not really affect the text content that is later indexed, it prevents me from using a parse filter to extract certain information based on the existence of certain div-tags. I'm by the way crawling a set of pages that I have control over.

So my question is: is there a configuration option (or some other way) to control how the tika parser will transform the document?

Thanks a bunch!
Felix