You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/03/01 13:35:20 UTC

IdentityHtmlMapper not used by Boilerpipe?

Hi,

We need div elements returned when we pass the stream through Boilerpipe from Nutch. We enable includeMarkup to get markup returned in the first place, but divs are not returned. In the ParseContext we set context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE) but this is not honored for some reason.

For some reason in the background DefaultHtmlMapper is being used, we know this because we do get divs returned if we add DIV,div to the SAFE_ELEMENTS Map. This is not very good because we prefer not to modify this parser class and because the unit test testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) fails if the div is added to the DefaultHtmlMapper.SAFE_ELEMENTS.

Any ideas on how we can force the IdentityMapper to be used instead?

Thanks,
Markus

Re: IdentityHtmlMapper not used by Boilerpipe?

Posted by Dan Klueter <da...@gmail.com>.
unsubscribe


On Fri, Mar 1, 2013 at 7:35 AM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi,
>
> We need div elements returned when we pass the stream through Boilerpipe
> from Nutch. We enable includeMarkup to get markup returned in the first
> place, but divs are not returned. In the ParseContext we set
> context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE) but this is not
> honored for some reason.
>
> For some reason in the background DefaultHtmlMapper is being used, we know
> this because we do get divs returned if we add DIV,div to the SAFE_ELEMENTS
> Map. This is not very good because we prefer not to modify this parser
> class and because the unit test
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) fails if the
> div is added to the DefaultHtmlMapper.SAFE_ELEMENTS.
>
> Any ideas on how we can force the IdentityMapper to be used instead?
>
> Thanks,
> Markus
>



-- 
Dan Klueter