You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2005/12/14 05:10:46 UTC

[jira] Created: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
----------------------------------------------------------------------------------------

         Key: NUTCH-140
         URL: http://issues.apache.org/jira/browse/NUTCH-140
     Project: Nutch
        Type: Improvement
  Components: fetcher  
 Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
    Reporter: Chris A. Mattmann
 Assigned to: Chris A. Mattmann 
    Priority: Minor


 Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.

  We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:

 <parse-plugins>
    ....

   <mimeType name="text/html">
      <plugin id="parse-html"/>
   </mimeType>

    .....
  
   <aliases>
   <alias name="parse-html"
extension-point="org.apache.nutch.parse.html.HtmlParser"/>

   ....
   <alias name="parse-html2" extension-point="my.other.html.Parser"/>
   
   ....
   </aliases>
</parse-plugins>



What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.

Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-140?page=all ]
     
Jerome Charron closed NUTCH-140:
--------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

I have committed the patch provided by Chris with some modifications:
(http://svn.apache.org/viewcvs.cgi?rev=379403&view=rev)

* Some minor code reformatting
* An extension id can be used directly in the parse-plugin.xml file without any alias definition (will help in a transitional phase when we get a admin gui)
* The API provides the ability to retrieve a parser from its extension-id or its alias (getParserByExtensionId)
* Remove the deprecated methods.
* Make use of the new APIs in parse-mp3 and parse-rtf

Thanks Chris


> Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
> ----------------------------------------------------------------------------------------
>
>          Key: NUTCH-140
>          URL: http://issues.apache.org/jira/browse/NUTCH-140
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: NUTCH-140.20051502.patch.txt
>
>  Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.
>   We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:
>  <parse-plugins>
>     ....
>    <mimeType name="text/html">
>       <plugin id="parse-html"/>
>    </mimeType>
>     .....
>   
>    <aliases>
>    <alias name="parse-html"
> extension-point="org.apache.nutch.parse.html.HtmlParser"/>
>    ....
>    <alias name="parse-html2" extension-point="my.other.html.Parser"/>
>    
>    ....
>    </aliases>
> </parse-plugins>
> What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.
> Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-140?page=all ]

Chris A. Mattmann updated NUTCH-140:
------------------------------------

    Attachment: NUTCH-140.20051502.patch.txt

An initial patch for NUTCH-140 for everyone's review.

> Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
> ----------------------------------------------------------------------------------------
>
>          Key: NUTCH-140
>          URL: http://issues.apache.org/jira/browse/NUTCH-140
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>  Attachments: NUTCH-140.20051502.patch.txt
>
>  Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.
>   We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:
>  <parse-plugins>
>     ....
>    <mimeType name="text/html">
>       <plugin id="parse-html"/>
>    </mimeType>
>     .....
>   
>    <aliases>
>    <alias name="parse-html"
> extension-point="org.apache.nutch.parse.html.HtmlParser"/>
>    ....
>    <alias name="parse-html2" extension-point="my.other.html.Parser"/>
>    
>    ....
>    </aliases>
> </parse-plugins>
> What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.
> Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360643 ] 

Chris A. Mattmann commented on NUTCH-140:
-----------------------------------------

Hey Stefan,

  Mainly, it would be to make them more human readable. Also, if I go in there and define all the aliases for the parsing plugin extensionIds that currently exist, there will be little tailoring for the user to have to do out of the box (similar to what I did already for parse-plugins.xml and how it has most of the mimeTypes in the system in there already out of the box). In my opinion (and of course, just my opinion, so take it with a grain of salt), I think it's easier to look at pluginIds such as "parse-html", rather than "org.apache.nutch.parse.html.HtmlParser", or something like that. It's a lot less characters to type too, ;) Another advantage is that it wouldn't change the way the system currently works, i.e., there would be no direct impact on users who are already used to mimeType->List of pluginIds in the parse-plugins.xml file.

Just my two cents.

Take care!

Cheers,
  Chris

> Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
> ----------------------------------------------------------------------------------------
>
>          Key: NUTCH-140
>          URL: http://issues.apache.org/jira/browse/NUTCH-140
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor

>
>  Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.
>   We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:
>  <parse-plugins>
>     ....
>    <mimeType name="text/html">
>       <plugin id="parse-html"/>
>    </mimeType>
>     .....
>   
>    <aliases>
>    <alias name="parse-html"
> extension-point="org.apache.nutch.parse.html.HtmlParser"/>
>    ....
>    <alias name="parse-html2" extension-point="my.other.html.Parser"/>
>    
>    ....
>    </aliases>
> </parse-plugins>
> What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.
> Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12366376 ] 

Chris A. Mattmann commented on NUTCH-140:
-----------------------------------------

Hi Folks,

 I've went ahead and created an initial patch for this issue. I'll be attaching it to JIRA within the next day for review. 

Thanks!

Cheers,
  Chris


> Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
> ----------------------------------------------------------------------------------------
>
>          Key: NUTCH-140
>          URL: http://issues.apache.org/jira/browse/NUTCH-140
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor

>
>  Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.
>   We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:
>  <parse-plugins>
>     ....
>    <mimeType name="text/html">
>       <plugin id="parse-html"/>
>    </mimeType>
>     .....
>   
>    <aliases>
>    <alias name="parse-html"
> extension-point="org.apache.nutch.parse.html.HtmlParser"/>
>    ....
>    <alias name="parse-html2" extension-point="my.other.html.Parser"/>
>    
>    ....
>    </aliases>
> </parse-plugins>
> What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.
> Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ] 

Stefan Groschupf commented on NUTCH-140:
----------------------------------------

>From my point of view this makes things more complicated, why not just use the extension id, where would be the advantage of aliases?
May the aliases would more human readable but in the end you have to define the aliases anyway and need to lookup the extension ids. So I think it is just one step more, but may I miss the advantage. 


> Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping
> ----------------------------------------------------------------------------------------
>
>          Key: NUTCH-140
>          URL: http://issues.apache.org/jira/browse/NUTCH-140
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor

>
>  Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem.
>   We propose to have the concept of "aliases" in the parse-plugins.xml file, defined at the end of the file, something lie:
>  <parse-plugins>
>     ....
>    <mimeType name="text/html">
>       <plugin id="parse-html"/>
>    </mimeType>
>     .....
>   
>    <aliases>
>    <alias name="parse-html"
> extension-point="org.apache.nutch.parse.html.HtmlParser"/>
>    ....
>    <alias name="parse-html2" extension-point="my.other.html.Parser"/>
>    
>    ....
>    </aliases>
> </parse-plugins>
> What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current "pluginId" concept.
> Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira