You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/15 06:54:21 UTC

Tika HTML parsing

For what it's worth, I just committed some patches to Tika that should  
improve Tika's ability to extract HTML outlinks (in <img> and <frame>  
elements, at least). Support for <iframe> should be coming soon :)

This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm  
tracking down, but I think Tika is getting closer to being usable by  
Nutch for typical web crawling.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Tika HTML parsing

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-15 20:01, Ken Krugler wrote:

>> * does this include image maps as well (<area>)?
>
> I've got a patch for that (the same one that does iframes). Hopefully
> I'll commit that today.

Cool.

>
>> * how does the code treat invalid html with both body and frameset?
>
> TagSoup should clean up the invalid HTML.
>
> The issue you'd run into with <body><frameset> is that TagSoup maps it
> to an empty <body />, followed by <frameset>...</frameset>.
>
> I committed a patch that fixes this, at least for the examples that I
> tried (including the one that Julien reported).

Great, that was one example of invalid HTML from our parse-html tests.

>
>> * what's the status of extracting the meta robots and link rel
>> information?
>
> All <meta> elements are now emitted in the resulting <head> element.
>
> And <link> and <base> elements should be passed through.

Sounds great.

>
> It would be great to get input on just how "fixed" things are now, or
> maybe after the next patch gets committed.

We have a set of torture tests that we subjected parse-html to... ;) 
we'll see how Tika fares now. Overall this sounds like a great progress!

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Tika HTML parsing

Posted by Ken Krugler <kk...@transpac.com>.

Hi Andrzej,

On Aug 15, 2010, at 12:04am, Andrzej Bialecki wrote:

> On 2010-08-15 06:54, Ken Krugler wrote:
>> For what it's worth, I just committed some patches to Tika that  
>> should
>> improve Tika's ability to extract HTML outlinks (in <img> and <frame>
>> elements, at least). Support for <iframe> should be coming soon :)
>>
>> This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
>> tracking down, but I think Tika is getting closer to being usable by
>> Nutch for typical web crawling.
>
> Thanks Ken for pushing forward this work! A few questions:
>
> * does this include image maps as well (<area>)?

I've got a patch for that (the same one that does iframes). Hopefully  
I'll commit that today.

> * how does the code treat invalid html with both body and frameset?

TagSoup should clean up the invalid HTML.

The issue you'd run into with <body><frameset> is that TagSoup maps it  
to an empty <body />, followed by <frameset>...</frameset>.

I committed a patch that fixes this, at least for the examples that I  
tried (including the one that Julien reported).

> * what's the status of extracting the meta robots and link rel  
> information?

All <meta> elements are now emitted in the resulting <head> element.

And <link> and <base> elements should be passed through.

It would be great to get input on just how "fixed" things are now, or  
maybe after the next patch gets committed.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Tika HTML parsing

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-15 06:54, Ken Krugler wrote:
> For what it's worth, I just committed some patches to Tika that should
> improve Tika's ability to extract HTML outlinks (in <img> and <frame>
> elements, at least). Support for <iframe> should be coming soon :)
>
> This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
> tracking down, but I think Tika is getting closer to being usable by
> Nutch for typical web crawling.

Thanks Ken for pushing forward this work! A few questions:

* does this include image maps as well (<area>)?

* how does the code treat invalid html with both body and frameset?

* what's the status of extracting the meta robots and link rel information?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com