You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sznajder ForMailingList <bs...@gmail.com> on 2015/08/17 17:51:03 UTC

Extracting the structure of an HTML Document

Hi

I am a new user of Tika.

I am handling HTML documents... I succeeded to parse the HTML documents to
a "clean" text string.

However, I am interested to get the structure of the documents : what are
the different sections, what are the titles of these sections etc...

Is there a way to do that with Tika?

Thanks!

Benjamin

RE: Extracting the structure of an HTML Document

Posted by Ken Krugler <kk...@transpac.com>.
Hi Benjamin,

It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get transformed), and your own content handler, so that you get all of the tag start/end SAX events. So something like...

        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

        new HtmlParser().parse (
                myInputStream,
                myContentHandler, 
                metadata,
                parseContext);

Where myContentHandler is an instance of a custom class that extends org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika). This will get called with all of the SAX events, in particular startElement(), endElement(), and characters()

-- Ken

> From: Sznajder ForMailingList
> Sent: August 17, 2015 8:51:03am PDT
> To: user@tika.apache.org
> Subject: Extracting the structure of an HTML Document
> 
> Hi
> 
> I am a new user of Tika.
> 
> I am handling HTML documents... I succeeded to parse the HTML documents to a "clean" text string.
> 
> However, I am interested to get the structure of the documents : what are the different sections, what are the titles of these sections etc...
> 
> Is there a way to do that with Tika?
> 
> Thanks!
> 
> Benjamin

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr