You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2010/07/07 14:55:36 UTC

Parse-tika ignores too much data...

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. 
prepare the test DOM-s with Tika's HtmlParser.

Results are not so good for some test cases... Even when using 
IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) 
and for some others (area) it drops the href. As a result, the number of 
valid outlinks collected with parse-tika is much smaller than with 
parse-html.

I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), 
and a partial fix was applied to Tika 0.8, but still this won't handle 
the problems I mentioned above.

Can we come up with a plan to address this? I'd rather switch completely 
to Tika-s HTML parsing, but at the moment we would lose too much useful 
data...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parse-tika ignores too much data...

Posted by Ken Krugler <kk...@transpac.com>.

On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:

> On 2010-07-07 22:32, Ken Krugler wrote:
>> Hi Julien,
>>
>>> See https://issues.apache.org/jira/browse/TIKA-457 for a description
>>> of one of the cases found by Andrzej. There seems to be something  
>>> very
>>> wrong with the way <body> is handled, we also saw cases were it was
>>> twice in the output.
>>
>> Don't know about the case of it appearing twice.
>>
>> But for the above issue, I added a comment. The test HTML is badly
>> broken, in that you can either have a <body> OR a <frameset>, but  
>> not both.
>
> The HTML was broken on purpose - one of the goals of the original  
> test was to get as much content and links in presence of grave  
> errors - as you know even major sites often produce a badly broken  
> HTML, but the parser sanitize it and produce a valid DOM. In this  
> case, it produced two nested <body> elements, which is not valid.

I'll need to check this out - the response from TagSoup was <body/>  
followed by the <frameset> data, and finally a closing </html>.

So if Tika is generating two bodies, then that's a bug in Tika. Though  
technically, having the <frameset> following the <body> is also invalid.

I'd suggest filing a Tika issue to do a better job of handling invalid  
framesets like this. Based on my experience, I don't think there would  
be an easy way to get this change into TagSoup.

> I should also mention that NekoHTML handled this test much better,  
> by removing the <body> and retaining only the <frameset>.

Yes, that's a well-known issue - certain docs are better handled by  
NekoHTML, while with others you get better results from TagSoup.

Anecdotally I'd heard that NekoHTML was better at extracting links.

Tika used to use NekoHTML, but switched to TagSoup last October. One  
reason was to avoid a troublesome dependency on Xerces.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parse-tika ignores too much data...

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-07-07 22:32, Ken Krugler wrote:
> Hi Julien,
>
>> See https://issues.apache.org/jira/browse/TIKA-457 for a description
>> of one of the cases found by Andrzej. There seems to be something very
>> wrong with the way <body> is handled, we also saw cases were it was
>> twice in the output.
>
> Don't know about the case of it appearing twice.
>
> But for the above issue, I added a comment. The test HTML is badly
> broken, in that you can either have a <body> OR a <frameset>, but not both.

The HTML was broken on purpose - one of the goals of the original test 
was to get as much content and links in presence of grave errors - as 
you know even major sites often produce a badly broken HTML, but the 
parser sanitize it and produce a valid DOM. In this case, it produced 
two nested <body> elements, which is not valid. I should also mention 
that NekoHTML handled this test much better, by removing the <body> and 
retaining only the <frameset>.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parse-tika ignores too much data...

Posted by Julien Nioche <li...@gmail.com>.

Hi Ken,

Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a  frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us the
option in Nutch to implement the logic we want i.e. ignore the body level if
we want to.

Thanks

J.

On 7 July 2010 21:32, Ken Krugler <kk...@transpac.com> wrote:

> Hi Julien,
>
> See https://issues.apache.org/jira/browse/TIKA-457 for a description of
> one of the cases found by Andrzej. There seems to be something very wrong
> with the way <body> is handled, we also saw cases were it was twice in the
> output.
>
>
> Don't know about the case of it appearing twice.
>
> But for the above issue, I added a comment. The test HTML is badly broken,
> in that you can either have a <body> OR a <frameset>, but not both.
>
> -- Ken
>
> On 7 July 2010 17:41, Ken Krugler <kk...@transpac.com> wrote:
>
>> Hi Andrzej,
>>
>> I've got a old list of cases where Tika was not extracting links:
>>
>>  - frame
>>  - iframe
>>  - img
>>  - map
>>  - object
>>  - link (only in <head> section)
>>
>> I worked around this in my crawling code, by directly processing the DOM,
>> but I should roll this into Tika.
>>
>> If you have a list of problems with test docs, file a TIKA issue and I'll
>> try to fix things up quickly.
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>> On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
>>
>>  Hi,
>>>
>>> I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
>>> prepare the test DOM-s with Tika's HtmlParser.
>>>
>>> Results are not so good for some test cases... Even when using
>>> IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
>>> for some others (area) it drops the href. As a result, the number of valid
>>> outlinks collected with parse-tika is much smaller than with parse-html.
>>>
>>> I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
>>> a partial fix was applied to Tika 0.8, but still this won't handle the
>>> problems I mentioned above.
>>>
>>> Can we come up with a plan to address this? I'd rather switch completely
>>> to Tika-s HTML parsing, but at the moment we would lose too much useful
>>> data...
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>> ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Parse-tika ignores too much data...

Posted by Ken Krugler <kk...@transpac.com>.

Hi Julien,

> See https://issues.apache.org/jira/browse/TIKA-457 for a description  
> of one of the cases found by Andrzej. There seems to be something  
> very wrong with the way <body> is handled, we also saw cases were it  
> was twice in the output.

Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly  
broken, in that you can either have a <body> OR a <frameset>, but not  
both.

-- Ken

> On 7 July 2010 17:41, Ken Krugler <kk...@transpac.com> wrote:
> Hi Andrzej,
>
> I've got a old list of cases where Tika was not extracting links:
>
>  - frame
>  - iframe
>  - img
>  - map
>  - object
>  - link (only in <head> section)
>
> I worked around this in my crawling code, by directly processing the  
> DOM, but I should roll this into Tika.
>
> If you have a list of problems with test docs, file a TIKA issue and  
> I'll try to fix things up quickly.
>
> Thanks,
>
> -- Ken
>
>
> On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
>
> Hi,
>
> I'm going through NUTCH-840, and I tried to eat our own dog food,  
> i.e. prepare the test DOM-s with Tika's HtmlParser.
>
> Results are not so good for some test cases... Even when using  
> IdentityHtmlMapper Tika ignores some elements (such as frame/ 
> frameset) and for some others (area) it drops the href. As a result,  
> the number of valid outlinks collected with parse-tika is much  
> smaller than with parse-html.
>
> I know this issue has been reported (TIKA-379, NUTCH-817,  
> NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
> this won't handle the problems I mentioned above.
>
> Can we come up with a plan to address this? I'd rather switch  
> completely to Tika-s HTML parsing, but at the moment we would lose  
> too much useful data...
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>
>
> -- 
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parse-tika ignores too much data...

Posted by Julien Nioche <li...@gmail.com>.

Ken,

See https://issues.apache.org/jira/browse/TIKA-457 for a description of one
of the cases found by Andrzej. There seems to be something very wrong with
the way <body> is handled, we also saw cases were it was twice in the
output.

J.

On 7 July 2010 17:41, Ken Krugler <kk...@transpac.com> wrote:

> Hi Andrzej,
>
> I've got a old list of cases where Tika was not extracting links:
>
>  - frame
>  - iframe
>  - img
>  - map
>  - object
>  - link (only in <head> section)
>
> I worked around this in my crawling code, by directly processing the DOM,
> but I should roll this into Tika.
>
> If you have a list of problems with test docs, file a TIKA issue and I'll
> try to fix things up quickly.
>
> Thanks,
>
> -- Ken
>
>
> On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
>
>  Hi,
>>
>> I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
>> prepare the test DOM-s with Tika's HtmlParser.
>>
>> Results are not so good for some test cases... Even when using
>> IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
>> for some others (area) it drops the href. As a result, the number of valid
>> outlinks collected with parse-tika is much smaller than with parse-html.
>>
>> I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
>> a partial fix was applied to Tika 0.8, but still this won't handle the
>> problems I mentioned above.
>>
>> Can we come up with a plan to address this? I'd rather switch completely
>> to Tika-s HTML parsing, but at the moment we would lose too much useful
>> data...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Parse-tika ignores too much data...

Posted by Ken Krugler <kk...@transpac.com>.

Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

  - frame
  - iframe
  - img
  - map
  - object
  - link (only in <head> section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.

If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.

Thanks,

-- Ken

On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

> Hi,
>
> I'm going through NUTCH-840, and I tried to eat our own dog food,  
> i.e. prepare the test DOM-s with Tika's HtmlParser.
>
> Results are not so good for some test cases... Even when using  
> IdentityHtmlMapper Tika ignores some elements (such as frame/ 
> frameset) and for some others (area) it drops the href. As a result,  
> the number of valid outlinks collected with parse-tika is much  
> smaller than with parse-html.
>
> I know this issue has been reported (TIKA-379, NUTCH-817,  
> NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
> this won't handle the problems I mentioned above.
>
> Can we come up with a plan to address this? I'd rather switch  
> completely to Tika-s HTML parsing, but at the moment we would lose  
> too much useful data...
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g