You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mukhit <mu...@gmail.com> on 2012/01/05 10:29:52 UTC

Extract span tag

Hello!
I try to extract span tags in html document, but unsucessfully. Tika html parser
extracts only tags like p,a,b,br,div.
Any suggestions would be nice.
Thanks.


Re: Extract span tag

Posted by Mukhit <mu...@gmail.com>.
Dear Jukka,

a lot of thanks for your response! It was my mistake, now everything is ok!

BR,

Mukhit



Re: Extract span tag

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jan 5, 2012 at 10:29 AM, Mukhit <mu...@gmail.com> wrote:
> I try to extract span tags in html document, but unsucessfully. Tika html parser
> extracts only tags like p,a,b,br,div.
> Any suggestions would be nice.

By default Tika attempts to normalize the incoming HTML document to
make it easier for client applications to consume. See the
org.apache.tika.parser.html.DefaultHtmlMapper class for the details.

You can either subclass DefaultHtmlMapper to mark also span tags as OK
to include in the XHTML output, or use the
org.apache.tika.parser.html.IdentityHtmlMapper class to disable all
normalization of the incoming HTML. Both solutions will allow span
tags to be passed to your application.

To make Tika use such an alternative HtmlMapper instance, simply pass
it as a part of the parse context, like this:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

    Parser parser = ...;
    parser.parse(..., context);

BR,

Jukka Zitting