You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Anne Blankert <an...@geodan.nl> on 2009/12/03 18:42:43 UTC

How to customize parsing html, retrieve
content?

Hello list,

This question is about how to get the content of <div 
id="article">..interesting content...</div>

Is the <div> element skipped on purpose or is there a way to tell the 
parser what to pass through and what not?

I am using Tika to extract plain text from documents behind RssFeeds. 
Many of these documents are HTML. Most of these HTML pages are based on 
templates. The template content is repeated for every such HTML page and 
does not contain useful information. I am only interested in the added 
content, not the templates themselves. I found that almost all such HTML 
pages mark the start and end of the interesting part, something like this:

<div id="article">....</div> or <div class="news">....</div> etc.

I wrote an extended ContentHandler to filter these marked parts from the 
html. I figured if I override methods "DefaultHandler.StartElement()" 
and "DefaultHandler.StopElement()", I would be able to extract the 
contents of these <div> elements. But I was wrong: from my sample HTML 
files, the tika parser only seems to pass through elements: 
<html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to the 
ContentHandler. ContentHandler.StartElement is not called for the <div> 
element.

I am using the tika parser like this:

<code>
URL itemURL = new URL(itemLink);
DataInputStream daHTMLfromDaItem = new 
DataInputStream(itemURL.openStream());

ContentHandler bodyContentHandler = new 
MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, 
htmlTagAttributeValue);
Metadata metadata = new Metadata();
AutoDetectParser p = new AutoDetectParser();
                   
try {
  // get HTML and convert to text in bodyContentHandler
  p.parse(daHTMLfromDaItem, bodyContentHandler, metadata);
  ...
</code>

Is there a way to tell the parser to call the Handler.StartElement() and 
Handler.StopElement() methods for elements like <div> ? Or should I use 
another method to get the content of these <div> elements?

Thanks,

Anne Blankert


Re: How to customize parsing html, retrieve
content?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Dec 15, 2009 at 10:31 PM, Anne Blankert <an...@geodan.nl> wrote:
> The following solves the quiet omission of <div> elements by the tika html
> parser:
>
> changed file apache/tika/parser/html/HtmlParser.java
> method
>   protected String mapSafeElement(String name)
> added line
>   if ("DIV".equals(name)) return "div";
>
> Could this change be applied to the tika source?

I'm not too excited about this change as it would be good to keep the
Tika output as simple as possible by default. The <div> elements
contain no inherent semantic meaning, so for a generic client (i.e.
one without domain-specific knowledge) they'd just be an unnecessary
distraction.

However, I can see how a client that does have better knowledge about
the expected document structure might want to have such information
passed through by Tika. See TIKA-347 for the very latest recommended
solution to this.

> Subclassing HtmlParser does not seem to be an easy alternative solution,
> because it requires changing the default TikaConfig.

See the TIKA-347 changes that I've just committed to the Tika trunk
and that will be included in the upcoming Tika 0.6 release. With these
changes it's possible to pass customized HTML mapping rules through
the parse context mechanism that was introduced in Tika 0.5. For
example, you could do this:

    class MyHtmlMapper extends DefaultHtmlMapper {
        public String mapSafeElement(String name) {
            if ("DIV".equals(name)) return "div";
            return super.mapSafeElement(name);
        }
    }

    Parser parser = ...;
    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new MyHtmlMapper());
    parser.parse(..., context);

BR,

Jukka Zitting

Re: How to customize parsing html, retrieve
content?

Posted by Anne Blankert <an...@geodan.nl>.
The following solves the quiet omission of <div> elements by the tika 
html parser:

changed file apache/tika/parser/html/HtmlParser.java
method
    protected String mapSafeElement(String name)
added line
    if ("DIV".equals(name)) return "div";

Could this change be applied to the tika source?

Subclassing HtmlParser does not seem to be an easy alternative solution, 
because it requires changing the default TikaConfig.


On 2009-12-03 18:42, Anne Blankert wrote:
> Hello list,
>
> This question is about how to get the content of <div 
> id="article">..interesting content...</div>
>
> Is the <div> element skipped on purpose or is there a way to tell the 
> parser what to pass through and what not?
>
> I am using Tika to extract plain text from documents behind RssFeeds. 
> Many of these documents are HTML. Most of these HTML pages are based 
> on templates. The template content is repeated for every such HTML 
> page and does not contain useful information. I am only interested in 
> the added content, not the templates themselves. I found that almost 
> all such HTML pages mark the start and end of the interesting part, 
> something like this:
>
> <div id="article">....</div> or <div class="news">....</div> etc.
>
> I wrote an extended ContentHandler to filter these marked parts from 
> the html. I figured if I override methods 
> "DefaultHandler.StartElement()" and "DefaultHandler.StopElement()", I 
> would be able to extract the contents of these <div> elements. But I 
> was wrong: from my sample HTML files, the tika parser only seems to 
> pass through elements: 
> <html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to 
> the ContentHandler. ContentHandler.StartElement is not called for the 
> <div> element.
>
> I am using the tika parser like this:
>
> <code>
> URL itemURL = new URL(itemLink);
> DataInputStream daHTMLfromDaItem = new 
> DataInputStream(itemURL.openStream());
>
> ContentHandler bodyContentHandler = new 
> MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, 
> htmlTagAttributeValue);
> Metadata metadata = new Metadata();
> AutoDetectParser p = new AutoDetectParser();
>                   try {
>  // get HTML and convert to text in bodyContentHandler
>  p.parse(daHTMLfromDaItem, bodyContentHandler, metadata);
>  ...
> </code>
>
> Is there a way to tell the parser to call the Handler.StartElement() 
> and Handler.StopElement() methods for elements like <div> ? Or should 
> I use another method to get the content of these <div> elements?
>
> Thanks,
>
> Anne Blankert
>


-- 

Drs. Anne Blankert

Geodan Systems & Research
President Kennedylaan 1
1079 MB Amsterdam (NL)
-------------------------------------
Tel: +31 (0)20 - 5711 311
Fax: +31 (0)20 - 5711 333
-------------------------------------
E-mail: anne.blankert@geodan.nl
Website: www.geodan.nl
Disclaimer: www.geodan.nl/disclaimer
-------------------------------------