You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/01/20 06:35:56 UTC
[jira] Updated: (TIKA-367) Mime type rootXML equality improvement

     [ https://issues.apache.org/jira/browse/TIKA-367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-367:
-----------------------------------

    Description: 
While working on TIKA-357 and TIKA-366, I noticed (and Ken did too) that XHTML detection was no longer working in his regression test within o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. The cause of this has to do with the fix for TIKA-327. Because I used namespaceless html and link tags as valid root XML for the text/html mime type, text/html was now matching for the application/html+xml example that Ken had previously included in o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. Phew. You still with me? OK, so if you are, it turns out that the reason it failed was due to the rootXML matches rules that were being employed. The code boiled down to:

        boolean matches(String namespaceURI, String localName) {
            //Compare namespaces
            if (!isEmpty(this.namespaceURI)) {
                if (!this.namespaceURI.equals(namespaceURI)) {
                    return false;
                }
            }

            //Compare root element's local name
            if (!isEmpty(this.localName)) {
                if (!this.localName.equals(localName)) {
                    return false;
                }
            }

   return true
}

The issue with this block is that this version of the #matches function is too lenient. So lenient, to the point of declaring one root-XML match for a localName "html" with no namespace superseded another root-XML with a localName "html", and that included a namespace. This isn't the behavior we want. To alleviate this we should check if this.namespaceURI and this.localName are empty (e.g., put in an else block above) and make sure that if they are, the provided namespaceURI and localName are empty as well in order to return true.

  was:
While working on TIKA-357 and TIKA-366, I noticed (and Ken did too) that XHTML detection was no longer working in his regression test within o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. The cause of this has to do with the fix for TIKA-327. Because I used namespaceless html and link tags as valid root XML for the text/html mime type, text/html was now matching for the application/html+xml example that Ken had previously included in o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. Phew. You still with me? OK, so if you are, it turns out that the reason it failed was due to the rootXML matches rules that were being employed. The code boiled down to:

{code}
        boolean matches(String namespaceURI, String localName) {
            //Compare namespaces
            if (!isEmpty(this.namespaceURI)) {
                if (!this.namespaceURI.equals(namespaceURI)) {
                    return false;
                }
            }

            //Compare root element's local name
            if (!isEmpty(this.localName)) {
                if (!this.localName.equals(localName)) {
                    return false;
                }
            }

   return true
}
{code}

The issue with this block is that this version of the #matches function is too lenient. So lenient, to the point of declaring one root-XML match for a localName "html" with no namespace superseded another root-XML with a localName "html", and that included a namespace. This isn't the behavior we want. To alleviate this we should check if this.namespaceURI and this.localName are empty (e.g., put in an else block above) and make sure that if they are, the provided namespaceURI and localName are empty as well in order to return true.


> Mime type rootXML equality improvement
> --------------------------------------
>
>                 Key: TIKA-367
>                 URL: https://issues.apache.org/jira/browse/TIKA-367
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 0.5
>         Environment: My local MacBook pro
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 0.6
>
>
> While working on TIKA-357 and TIKA-366, I noticed (and Ken did too) that XHTML detection was no longer working in his regression test within o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. The cause of this has to do with the fix for TIKA-327. Because I used namespaceless html and link tags as valid root XML for the text/html mime type, text/html was now matching for the application/html+xml example that Ken had previously included in o.a.tika.parser.html.HtmlParserTest#testXhtmlParsing. Phew. You still with me? OK, so if you are, it turns out that the reason it failed was due to the rootXML matches rules that were being employed. The code boiled down to:
>         boolean matches(String namespaceURI, String localName) {
>             //Compare namespaces
>             if (!isEmpty(this.namespaceURI)) {
>                 if (!this.namespaceURI.equals(namespaceURI)) {
>                     return false;
>                 }
>             }
>             //Compare root element's local name
>             if (!isEmpty(this.localName)) {
>                 if (!this.localName.equals(localName)) {
>                     return false;
>                 }
>             }
>    return true
> }
> The issue with this block is that this version of the #matches function is too lenient. So lenient, to the point of declaring one root-XML match for a localName "html" with no namespace superseded another root-XML with a localName "html", and that included a namespace. This isn't the behavior we want. To alleviate this we should check if this.namespaceURI and this.localName are empty (e.g., put in an else block above) and make sure that if they are, the provided namespaceURI and localName are empty as well in order to return true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.