You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sznajder ForMailingList <bs...@gmail.com> on 2015/08/17 17:51:03 UTC
Extracting the structure of an HTML Document
Hi
I am a new user of Tika.
I am handling HTML documents... I succeeded to parse the HTML documents to
a "clean" text string.
However, I am interested to get the structure of the documents : what are
the different sections, what are the titles of these sections etc...
Is there a way to do that with Tika?
Thanks!
Benjamin
RE: Extracting the structure of an HTML Document
Posted by Ken Krugler <kk...@transpac.com>.
Hi Benjamin,
It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get transformed), and your own content handler, so that you get all of the tag start/end SAX events. So something like...
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
new HtmlParser().parse (
myInputStream,
myContentHandler,
metadata,
parseContext);
Where myContentHandler is an instance of a custom class that extends org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika). This will get called with all of the SAX events, in particular startElement(), endElement(), and characters()
-- Ken
> From: Sznajder ForMailingList
> Sent: August 17, 2015 8:51:03am PDT
> To: user@tika.apache.org
> Subject: Extracting the structure of an HTML Document
>
> Hi
>
> I am a new user of Tika.
>
> I am handling HTML documents... I succeeded to parse the HTML documents to a "clean" text string.
>
> However, I am interested to get the structure of the documents : what are the different sections, what are the titles of these sections etc...
>
> Is there a way to do that with Tika?
>
> Thanks!
>
> Benjamin
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr