You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy (JIRA)" <ji...@apache.org> on 2011/08/25 16:12:28 UTC

[jira] [Created] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html

application/xhtml+xml should be enabled in plugin.xml of parse-html
-------------------------------------------------------------------

                 Key: NUTCH-1097
                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.3
            Reporter: Ferdy
            Priority: Minor


Since the configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1097:
---------------------------------

       Patch Info: Patch Available
    Fix Version/s: 1.4

Related mailing list thread: http://search-lucene.com/m/iM9eVUHUvh&subj=Parsing+only+common+file+types
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-trunk_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125903#comment-13125903 ] 

Lewis John McGibbney commented on NUTCH-1097:
---------------------------------------------

OK Ferdy this fine for me. trunk patch compiles and passes all tests. It would be great to get this committed in 1.4. I am happy to act as assignee and commit if there are no further comments/suggestions. Same applies for nutchgora branch.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Summary: application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml  (was: application/xhtml+xml should be enabled in plugin.xml of parse-html)

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-v3.patch

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-nutchgora_v1.patch

renamed patch to reflect recent move of nutchgora to a branch
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Description: The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.  (was: Since the configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.)

> application/xhtml+xml should be enabled in plugin.xml of parse-html
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-v4.patch
                NUTCH-1097-nutchgora_v2.patch

Thanks for looking into this, too.

The following patches (one for nutchgora and one for trunk/1.x) apply your suggestion. By the way, the nutchgora_v3 patch did not have the proper change for the plugin.xml, it was accidently excluded. This is fixed now.

Also the change is properly documented in ParserFactory so that anyone scanning the code will notice why it is done this particular way.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125367#comment-13125367 ] 

Lewis John McGibbney commented on NUTCH-1097:
---------------------------------------------

Does anyone else have input for this one? I think it is a valuable contribution and makes perfect sense... the inverse of this is that it makes no-sense for parse-html to not parse application/xhtml+xml files.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney closed NUTCH-1097.
---------------------------------------

    
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125916#comment-13125916 ] 

Andrzej Bialecki  commented on NUTCH-1097:
------------------------------------------

+1, the latest patch looks good.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989 ] 

Ferdy edited comment on NUTCH-1097 at 9/2/11 1:59 PM:
------------------------------------------------------

After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard 'asterisk')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the '+' character. The wildcard 'asterisk' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply.

      was (Author: ferdy.g):
    After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard '*')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the '+' character. The wildcard '*' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because '*' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply.
  
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125925#comment-13125925 ] 

Markus Jelsma commented on NUTCH-1097:
--------------------------------------

+1, very useful  
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124761#comment-13124761 ] 

Ferdy commented on NUTCH-1097:
------------------------------

Hi,

As far as I know, currently parse-tika is used as a catch-all parser. So yes, if you were to exclude parse-html, tika would just parse all html related types. However the fact that Nutch still has a separate html parser indicates that (for some reason) html should not be handled by tika yet. Since application/xhtml+xml is actually also html but with a more uncommon mimetype I believe it should be handled by the same parser that handles text/html. More importantly, I would still not be able to specify that only text/html and application/xhtml+xml should be parsed. ALL mimetypes will be parsed by the catch-all handler, since you can only specify the wildcard or ONE mimetype per parser.

The "text/html and application/xhtml+xml" use case is just a single example of why there should be a bit more flexibility for configuring parsers. If somebody wants to dispatch more than one mimetype (but not all) to a specific parser, that just not possible. This is regardless of using a tika (or whatever parser) as a catch-all parser.

Nevertheless my regex solution feels a bit hackish so it should be properly documented when committed.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127430#comment-13127430 ] 

Hudson commented on NUTCH-1097:
-------------------------------

Integrated in Nutch-trunk #1631 (See [https://builds.apache.org/job/Nutch-trunk/1631/])
    commit to address NUTCH-1097 and update to changes.txt

lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182506
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/trunk/src/plugin/parse-html/plugin.xml

                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-v2.patch

Patch v1 results in a warning. This patch allows html-parse to accept all mimetypes. I'm not sure what the best way is, I'm guessing Nutch will get move to Tika parsing all-the-way soon..

> application/xhtml+xml should be enabled in plugin.xml of parse-html
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment:     (was: NUTCH-1097-trunk_v1.patch)
    
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127376#comment-13127376 ] 

Hudson commented on NUTCH-1097:
-------------------------------

Integrated in Nutch-nutchgora #32 (See [https://builds.apache.org/job/Nutch-nutchgora/32/])
    commit to address NUTCH-1097 and update to changes.txt

lewismc : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1182504
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/branches/nutchgora/src/plugin/parse-html/plugin.xml

                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-v1.patch

> application/xhtml+xml should be enabled in plugin.xml of parse-html
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989 ] 

Ferdy commented on NUTCH-1097:
------------------------------

After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard '*')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the '+' character. The wildcard '*' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because '*' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy updated NUTCH-1097:
-------------------------

    Attachment: NUTCH-1097-trunk_v1.patch

The patch for Nutch trunk.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-trunk_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney reassigned NUTCH-1097:
-------------------------------------------

    Assignee: Lewis John McGibbney
    
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Andrzej Bialecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125414#comment-13125414 ] 

Andrzej Bialecki  commented on NUTCH-1097:
------------------------------------------

+1 the idea makes sense. Patch looks good, but it needs a minor fix - mime types may contain also "." characters, e.g. "application/vnd.ms-word", and these need to be escaped too.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-1097.
-----------------------------------------

       Resolution: Fixed
    Fix Version/s: nutchgora

Committed @ revision 1182504 in nutchgora branch
Committed @ revision 1182506 in nutch trunk 1.4


                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140120#comment-13140120 ] 

Hudson commented on NUTCH-1097:
-------------------------------

Integrated in nutch-trunk-maven #3 (See [https://builds.apache.org/job/nutch-trunk-maven/3/])
    commit to address NUTCH-1097 and update to changes.txt

lewismc : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1182506
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/trunk/src/plugin/parse-html/plugin.xml

                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124498#comment-13124498 ] 

Lewis John McGibbney commented on NUTCH-1097:
---------------------------------------------

Hi Ferdy, in general I think this looks OK and think that you are correct that parse-html should also accept application/xhtml+xml. The issue with this is whether or not this would be covered by parse-tika if for example parse-html was not to be included within plugin.includes?
>From what I have read, I do not see what benefit this provides over calling parse-tika to deal with all application/xhtml+xml mimeTypes? Please correct me where I am wrong. Thanks
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124502#comment-13124502 ] 

Lewis John McGibbney commented on NUTCH-1097:
---------------------------------------------

having re-read the list thread and the full issue I'm tempted towards +1 if you can clarify my thoughts above. I'll begin testing this and give feedback asap.
                
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989 ] 

Ferdy edited comment on NUTCH-1097 at 9/2/11 2:04 PM:
------------------------------------------------------

After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard 'asterisk')

Too keep backwards compatibility, I decided to escape 'plus' character in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the 'plus' character. The wildcard 'asterisk' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped 'plus' character.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply.

(Sorry for the edits. The plus/asterisk characters are messing up my layout.)

      was (Author: ferdy.g):
    After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard 'asterisk')

Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the '+' character. The wildcard 'asterisk' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply.
  
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html

Posted by "Ferdy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093789#comment-13093789 ] 

Ferdy commented on NUTCH-1097:
------------------------------

It seems the current solution is still not complete, because now INFO's are logged for the following type of messages.

The parsing plugins: [org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes system property, and all claim to support the content type image/png, but they are not mapped to it  in the parse-plugins.xml file.

Anyone else's thoughts on this?

> application/xhtml+xml should be enabled in plugin.xml of parse-html
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira