You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/06/16 15:28:36 UTC

xml vs html parser

All,

  On govdocs1, the xml parser's exceptions accounted for nearly a quarter of all thrown exceptions at one point (Tika 1.7ish).  Typically, a file was mis-identified as xml when in fact it was sgml or some other text based file with some markup that wasn't meant to be xml.

  For kicks, I switched  the config to use the HtmlParser for files identified as xml.  This got rid of the exceptions, but the content was quite different (ballpark 6k files out of 35k files had similarity < 0.95) mostly because of elisions "the quick" -> "thequick", and I assume this is across tags...

  So, is there a way to make the XMLParser more lenient?  Or is there a way to configure the HtmlParser to add spaces for non-html tags?

  Or, is there a better solution?



     Thank you!



              Best,



                 Tim

RE: xml vs html parser

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Jukka,
  Sorry for my delay.

addSpaceBetweenElements  ...exactly what I was looking for.  Thank you.

  I'll send an update after further analysis of the incorrectly identified files to see if we can tweak our mimes.

      Cheers,

              Tim

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Tuesday, June 16, 2015 10:26 AM
To: Tika Users
Subject: Re: xml vs html parser

Hi,

2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <ta...@mitre.org>:
> So, is there a way to make the XMLParser more lenient?

I don't think so. XML is draconian by design.

> Or is there a way to configure the HtmlParser to add spaces for
> non-html tags?

One option that wouldn't require changes in Tika code could be to use
HtmlParser with the IdentityHtmlMapper and process the output using
TextContentHandler with the addSpaceBetweenElements option enabled.

> Or, is there a better solution?

The cleanest alternative would be to come up with a more accurate
detection heuristics to detect SGML.

Are there some common file name patterns, DOCTYPEs or other easily
identifiable bits that could be used to improve the accuracy of type
detection?

Things like the <?xml ...?> header, presence of xmlns attributes, the
.xml file extension, etc. can be used as highly reliable signals for
XML content, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.

BR,

Jukka Zitting

Re: xml vs html parser

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <ta...@mitre.org>:
> So, is there a way to make the XMLParser more lenient?

I don't think so. XML is draconian by design.

> Or is there a way to configure the HtmlParser to add spaces for
> non-html tags?

One option that wouldn't require changes in Tika code could be to use
HtmlParser with the IdentityHtmlMapper and process the output using
TextContentHandler with the addSpaceBetweenElements option enabled.

> Or, is there a better solution?

The cleanest alternative would be to come up with a more accurate
detection heuristics to detect SGML.

Are there some common file name patterns, DOCTYPEs or other easily
identifiable bits that could be used to improve the accuracy of type
detection?

Things like the <?xml ...?> header, presence of xmlns attributes, the
.xml file extension, etc. can be used as highly reliable signals for
XML content, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.

BR,

Jukka Zitting