Hello list, This question is about how to get the content of <div id="article">..interesting content...</div> Is the <div> element skipped on purpose or is there a way to tell the parser what to pass through and what not? I am using Tika to extract plain text from documents behind RssFeeds. Many of these documents are HTML. Most of these HTML pages are based on templates. The template content is repeated for every such HTML page and does not contain useful information. I am only interested in the added content, not the templates themselves. I found that almost all such HTML pages mark the start and end of the interesting part, something like this: <div id="article">....</div> or <div class="news">....</div> etc. I wrote an extended ContentHandler to filter these marked parts from the html. I figured if I override methods "DefaultHandler.StartElement()" and "DefaultHandler.StopElement()", I would be able to extract the contents of these <div> elements. But I was wrong: from my sample HTML files, the tika parser only seems to pass through elements: <html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to the ContentHandler. ContentHandler.StartElement is not called for the <div> element. I am using the tika parser like this: <code> URL itemURL = new URL(itemLink); DataInputStream daHTMLfromDaItem = new DataInputStream(itemURL.openStream()); ContentHandler bodyContentHandler = new MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, htmlTagAttributeValue); Metadata metadata = new Metadata(); AutoDetectParser p = new AutoDetectParser(); try { // get HTML and convert to text in bodyContentHandler p.parse(daHTMLfromDaItem, bodyContentHandler, metadata); ... </code> Is there a way to tell the parser to call the Handler.StartElement() and Handler.StopElement() methods for elements like <div> ? Or should I use another method to get the content of these <div> elements? Thanks, Anne Blankert
Hi, On Tue, Dec 15, 2009 at 10:31 PM, Anne Blankert <an...@geodan.nl> wrote: > The following solves the quiet omission of <div> elements by the tika html > parser: > > changed file apache/tika/parser/html/HtmlParser.java > method > protected String mapSafeElement(String name) > added line > if ("DIV".equals(name)) return "div"; > > Could this change be applied to the tika source? I'm not too excited about this change as it would be good to keep the Tika output as simple as possible by default. The <div> elements contain no inherent semantic meaning, so for a generic client (i.e. one without domain-specific knowledge) they'd just be an unnecessary distraction. However, I can see how a client that does have better knowledge about the expected document structure might want to have such information passed through by Tika. See TIKA-347 for the very latest recommended solution to this. > Subclassing HtmlParser does not seem to be an easy alternative solution, > because it requires changing the default TikaConfig. See the TIKA-347 changes that I've just committed to the Tika trunk and that will be included in the upcoming Tika 0.6 release. With these changes it's possible to pass customized HTML mapping rules through the parse context mechanism that was introduced in Tika 0.5. For example, you could do this: class MyHtmlMapper extends DefaultHtmlMapper { public String mapSafeElement(String name) { if ("DIV".equals(name)) return "div"; return super.mapSafeElement(name); } } Parser parser = ...; ParseContext context = new ParseContext(); context.set(HtmlMapper.class, new MyHtmlMapper()); parser.parse(..., context); BR, Jukka Zitting
The following solves the quiet omission of <div> elements by the tika html parser: changed file apache/tika/parser/html/HtmlParser.java method protected String mapSafeElement(String name) added line if ("DIV".equals(name)) return "div"; Could this change be applied to the tika source? Subclassing HtmlParser does not seem to be an easy alternative solution, because it requires changing the default TikaConfig. On 2009-12-03 18:42, Anne Blankert wrote: > Hello list, > > This question is about how to get the content of <div > id="article">..interesting content...</div> > > Is the <div> element skipped on purpose or is there a way to tell the > parser what to pass through and what not? > > I am using Tika to extract plain text from documents behind RssFeeds. > Many of these documents are HTML. Most of these HTML pages are based > on templates. The template content is repeated for every such HTML > page and does not contain useful information. I am only interested in > the added content, not the templates themselves. I found that almost > all such HTML pages mark the start and end of the interesting part, > something like this: > > <div id="article">....</div> or <div class="news">....</div> etc. > > I wrote an extended ContentHandler to filter these marked parts from > the html. I figured if I override methods > "DefaultHandler.StartElement()" and "DefaultHandler.StopElement()", I > would be able to extract the contents of these <div> elements. But I > was wrong: from my sample HTML files, the tika parser only seems to > pass through elements: > <html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to > the ContentHandler. ContentHandler.StartElement is not called for the > <div> element. > > I am using the tika parser like this: > > <code> > URL itemURL = new URL(itemLink); > DataInputStream daHTMLfromDaItem = new > DataInputStream(itemURL.openStream()); > > ContentHandler bodyContentHandler = new > MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, > htmlTagAttributeValue); > Metadata metadata = new Metadata(); > AutoDetectParser p = new AutoDetectParser(); > try { > // get HTML and convert to text in bodyContentHandler > p.parse(daHTMLfromDaItem, bodyContentHandler, metadata); > ... > </code> > > Is there a way to tell the parser to call the Handler.StartElement() > and Handler.StopElement() methods for elements like <div> ? Or should > I use another method to get the content of these <div> elements? > > Thanks, > > Anne Blankert > -- Drs. Anne Blankert Geodan Systems & Research President Kennedylaan 1 1079 MB Amsterdam (NL) ------------------------------------- Tel: +31 (0)20 - 5711 311 Fax: +31 (0)20 - 5711 333 ------------------------------------- E-mail: anne.blankert@geodan.nl Website: www.geodan.nl Disclaimer: www.geodan.nl/disclaimer -------------------------------------