You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/09/15 04:49:10 UTC

[Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by ChrisMattmann:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal

The comment on the change is:
Initial Draft of ParserFactoryImprovementProposal

New page:
= Parser Factory Improvement Proposal =


== Summary of Issue ==
Currently Nutch provides a plugin mechanism wherein plugins register certain metadata about themselves, including their id, classname, and so forth. In particular, the set of parsing plugins register which contentTypes and file suffixes they can support with a PluginRepository.

One “adopted practice” in current Nutch parsing plugins (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has also been to verify that the content type passed to it during a fetch is indeed one of the contentTypes that it supports (be it application/xml, or application/pdf, etc.). This practice is cumbersome for a few reasons:

 *Any updates to supported content types for a parsing plugin will require a recompilation of the plugin code
 *Checking for “hard coded” content types within the parsing plugin is a duplication of information that already exists in the plugin’s descriptor file, plugin.xml
 *By the time that content gets to a parsing plugin, (e.g., the parsing plugin is returned by the ParserFactory, and provided content during a fetch), the ParsingFactory should have already ensured that the appropriate plugin is getting called for a particular contentType.

In addition to this problem is the fact that several parsing plugins may all support many of the same content types. For instance, the parse-js plugin may be the only well suited parsing plugin for javascript, but perhaps it may also provided a good enough heuristic parser for plain text as well, and so it may support both types. However, there may be a parsing plugin for text (which there is!), parse-text, whose primary purpose is to parse plain text as well.

== Suggested Remedy ==
To deal with ensuring the desired parsing plugin is called for the appropriate content type, and to in effect, “kill two birds with one stone”, we propose that there be a parsing plugin preference list for each content type that Nutch knows how to handle, i.e., each content type available via the mimeType system. Therefore, during a fetch, once the appropriate mimeType has been determined for content, and the ParserFactory is tasked with returning a parsing plugin, the ParserFactory should consult a preference list for that contentType, allowing it to determine which plugin has the highest preference for the contentType. That parsing plugin should be returned via the ParserFactory to the fetcher. If there is any problem using the initial returned parsing plugin for a particular contentType (i.e., if a ParseException is throw during the parser, or a null ParseStatus is returned), then the ParserFactory should be called again, this time asking for the “next highest ranked
 ” plugin for that contentType. Such a process should repeat on and on until the parse is successful.

We propose that the “plugin preference list” should be a separate file that lives in $NUTCH_HOME/conf called “parse-plugins.xml”. The format of the file (full DTD to be developed during coding) should be something like: {{{

<parse-plugins>
  <default pluginname=”parse-text”/>
  <fileType name=”powerpoint”>
   <mimeTypes>
    <mimeType name=”application/pdf” />
    <mimeType name=”application/x-pdf” />
    …
   </mimeTypes>

   <plugins>

      <plugin name=”parse-pdf” order=”1”/>
      <plugin name=”parse-pdf-worse” order=”2”/>
     …
   </plugins>
  </fileType>
    …
</parse-plugins>

}}}


One of the main impacts of having a file like parse-plugins.xml is that no longer should the pathSuffix="" be part of the plugin.xml descriptor. We propose to move that out of plugin.xml and into the mime-types.xml file.

== Architectural Impact ==

=== Components ===
 *Fetcher
 *PluginSystem
 *ParserFactory

=== Impact on current releases of Nutch ===

''Incompatibilities''

By moving the contentType and pathSuffix out of the plugin.xml file, this would create an updated version of the plugin.xml descriptor schema for each plugin. To lessen the effect on previous and near-term releases of Nutch this information could be left as an option in the plugin.xml schema, but marked as “deprecated” to let people know that this functionality isn’t part of the parse plugin identification process anymore, but it is left in the schema so as not to create incompatibilities with the plugin.xml files that people have already wrote. However, ultimately in future releases of Nutch, we propose that the contentType and pathSuffix attributes should be removed from the plugin.xml schema.

Other than the plugin.xml file schema change, this capability addition will simply control the order in which parsing plugins get called during fetching activities. It won’t directly impact the segments stored, or the webapp, or any of the main components of Nutch.

''Issues''

The proposed new capabilities should be first tested on local systems, and if successful, uploaded to JIRA, and verified against the latest SVNs.
Unit tests should be written to verify appropriate plugin parsing order.
Users will need to be notified in the Nutch tutorial and instruction lists about how to set up the parsing plugin preferences prior to performing a fetch.

== Personnel ==

 *Jerome Charron
 *Sébastien Le Callonnec
 *Chris A. Mattmann

== Timeframe ==

 *Begin work the weekend of 9/9
 *Complete first prototype patches to JIRA by end of week, 9/18
 *Test against latest SVNs of Nutch, by 9/25
 *Delivery of operational capability, by 10/1

== Affected files ==
 *PluginRepository.java
 *PluginManifestParser.java
 *ParserFactory.java
 *plugin.xml descriptor files
 *files in package {{{org.apache.nutch.util.mime}}}

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by og...@yahoo.com.
Sounds good to me.

Otis

--- Chris Mattmann <ch...@jpl.nasa.gov> wrote:

> Hi Otis,
> 
>  Point taken. In actuality since both convey the same information I
> think
> that it's okay to support both, but by default say we could code the
> initial
> plugins specified in parse-plugins.xml without the "order="
> attribute. Fair
> enough?
> 
> Cheers,
>   Chris
> 
> 
> 
> On 9/15/05 3:23 PM, "ogjunk-nutch@yahoo.com" <og...@yahoo.com>
> wrote:
> 
> > Well, you have to tell users about order="N" somewhere in the docs.
> > Instead of telling them about order="N", tell them that the order
> in
> > XML matters.  Either case requires education, and the latter one
> > requires less typing and avoids the case described in the proposal.
> > 
> > Otis
> > 
> > --- Sébastien LE CALLONNEC <sl...@yahoo.ie> wrote:
> > 
> >> Hi Otis,
> >> 
> >> 
> >> This issue arose during our discussion for this proposal, and my
> >> feeling was that the XML specification doesn't state that the
> order
> >> is
> >> significant in an XML file.  I therefore read the spec again, and
> >> indeed didn't find anything on that subject...
> >> 
> >> I think it is somehow reasonable to consider that a parser _might_
> >> return the elements in a different order—though, as I mentioned
> to
> >> Chris & Jerome, that would be quite unheard of, and, to be
> honnest,
> >> rather irritating.
> >> 
> >> What do you think?
> >> 
> >> 
> >> Regards,
> >> Sebastien.
> >> 
> >> 
> >> 
> >>> Quick comment about order="N" and the paragraph that describes
> how
> >> to
> >>> deal with cases where people mess things up and enter multiple
> >>> plugins
> >>> for the same content type and the same order:
> >>> 
> >>> - Why is the order attribute even needed?  It looks like a
> >> redundant
> >>> piece of information - why not derive order from the order of
> >> plugin
> >>> definitions in the XML file?
> >>> 
> >>> For instance:
> >>> Instead of this:
> >>> 
> >>>   <mimeType name="*">
> >>>       <plugin id=”parse-text” order=”1”/>
> >>>       <plugin id=”another-one-default-parser” order=”2”/>
> >>>      ....
> >>>   </mimeType>
> >>> 
> >>> We have this:
> >>> 
> >>>   <mimeType name="*">
> >>>       <plugin id=”parse-text”/>
> >>>       <plugin id=”another-one-default-parser”/>
> >>>      ....
> >>>   </mimeType>
> >>> 
> >>> parse-text first, another-one-default-parser second.  Less
> typing,
> >>> and
> >>> we avoid the case of equal ordering all together.
> >>> 
> >>> Otis
> >>> 
> >>> 
> >>> --- Apache Wiki <wi...@apache.org> wrote:
> >>> 
> >>>> Dear Wiki user,
> >>>> 
> >>>> You have subscribed to a wiki page or wiki category on "Nutch
> >> Wiki"
> >>>> for change notification.
> >>>> 
> >>>> The following page has been changed by ChrisMattmann:
> >>>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
> >>>> 
> >>>> The comment on the change is:
> >>>> Initial Draft of ParserFactoryImprovementProposal
> >>>> 
> >>>> New page:
> >>>> = Parser Factory Improvement Proposal =
> >>>> 
> >>>> 
> >>>> == Summary of Issue ==
> >>>> Currently Nutch provides a plugin mechanism wherein plugins
> >>> register
> >>>> certain metadata about themselves, including their id,
> classname,
> >>> and
> >>>> so forth. In particular, the set of parsing plugins register
> >> which
> >>>> contentTypes and file suffixes they can support with a
> >>>> PluginRepository.
> >>>> 
> >>>> One “adopted practice� in current Nutch parsing
> plugins
> >>>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
> >> has
> >>>> also been to verify that the content type passed to it during a
> >>> fetch
> >>>> is indeed one of the contentTypes that it supports (be it
> >>>> application/xml, or application/pdf, etc.). This practice is
> >>>> cumbersome for a few reasons:
> >>>> 
> >>>>  *Any updates to supported content types for a parsing plugin
> >> will
> >>>> require a recompilation of the plugin code
> >>>>  *Checking for “hard coded� content types within
> the parsing
> >>>> plugin is a duplication of information that already exists in
> the
> >>>> plugin’s descriptor file, plugin.xml
> >>>>  *By the time that content gets to a parsing plugin, (e.g., the
> >>>> parsing plugin is returned by the ParserFactory, and provided
> >>> content
> >>>> during a fetch), the ParsingFactory should have already ensured
> >>> that
> >>>> the appropriate plugin is getting called for a particular
> >>>> contentType.
> >>>> 
> >>>> In addition to this problem is the fact that several parsing
> >>> plugins
> >>>> may all support many of the same content types. For instance,
> the
> >>>> parse-js plugin may be the only well suited parsing plugin for
> >>>> javascript, but perhaps it may also provided a good enough
> >>> heuristic
> >>>> parser for plain text as well, and so it may support both types.
> >>>> However, there may be a parsing plugin for text (which there
> >> is!),
> >>>> parse-text, whose primary purpose is to parse plain text as
> well.
> >>>> 
> >>>> == Suggested Remedy ==
> >>>> To deal with ensuring the desired parsing plugin is called for
> >> the
> >>>> appropriate content type, and to in effect, “kill two
> birds
> >> with
> >>>> one stone�, we propose that there be a parsing plugin
> >> preference
> >>>> list for each content type that Nutch knows how to handle, i.e.,
> >>> each
> >>>> content type available via the mimeType system. Therefore,
> during
> >> a
> >>>> fetch, once the appropriate mimeType has been determined for
> >>> content,
> >>>> and the ParserFactory is tasked with returning a parsing plugin,
> >>> the
> >>>> ParserFactory should consult a preference list for that
> >>> contentType,
> >>>> allowing it to determine which plugin has the highest preference
> >>> for
> >>>> the contentType. That parsing plugin should be returned via the
> >>>> ParserFactory to the fetcher. If there is any problem using the
> >>>> initial returned parsing plugin for a particular contentType
> >> (i.e.,
> >>>> if a ParseException is throw during the parser, or a null
> >>> ParseStatus
> >>>> is returned), then the ParserFactory should be called again,
> this
> >>>> time asking for the “next highest ranked
> >>>>  � plugin for that contentType. Such a process should
> repeat on
> >>> and
> >>>> on until the parse is successful.
> >>>> 
> >>>> We propose that the “plugin preference list� should
> be a
> >>> separate
> >>>> file that lives in $NUTCH_HOME/conf called
> >> “parse-plugins.xml�.
> >>>> The format of the file (full DTD to be developed during coding)
> >>>> should be something like: {{{
> >>>> 
> >>>> <parse-plugins>
> >>>>   <default pluginname=�parse-text�/>
> >>>>   <fileType name=�powerpoint�>
> >>>>    <mimeTypes>
> >>>>     <mimeType name=�application/pdf� />
> >>>>     <mimeType name=�application/x-pdf� />
> >>>>     …
> >>>>    </mimeTypes>
> >>>> 
> >>>>    <plugins>
> >>>> 
> >>>>       <plugin name=�parse-pdf�
> order=�1�/>
> >>>>       <plugin name=�parse-pdf-worse�
> order=�2�/>
> >>>>      …
> >>>>    </plugins>
> >>>>   </fileType>
> >>>>     …
> >>>> </parse-plugins>
> >>>> 
> >>>> }}}
> >>>> 
> >>>> 
> >>>> One of the main impacts of having a file like parse-plugins.xml
> >> is
> >>>> that no longer should the pathSuffix="" be part of the
> plugin.xml
> >>>> descriptor. We propose to move that out of plugin.xml and into
> >> the
> >>>> mime-types.xml file.
> >>>> 
> >>>> == Architectural Impact ==
> >>>> 
> >>>> === Components ===
> >>>>  *Fetcher
> >>>>  *PluginSystem
> >>>>  *ParserFactory
> >>>> 
> >>>> === Impact on current releases of Nutch ===
> >>>> 
> >>>> ''Incompatibilities''
> >>>> 
> >>>> By moving the contentType and pathSuffix out of the plugin.xml
> >>> file,
> >>>> this would create an updated version of the plugin.xml
> descriptor
> >>>> schema for each plugin. To lessen the effect on previous and
> >>>> near-term releases of Nutch this information could be left as an
> >>>> option in the plugin.xml schema, but marked as
> “deprecated�
> >> to
> >>>> let people know that this functionality isn’t part of the
> parse
> >>>> plugin identification process anymore, but it is left in the
> >> schema
> >>>> so as not to create incompatibilities with the plugin.xml files
> >>> that
> >>>> people have already wrote. However, ultimately in future
> releases
> >>> of
> >>>> Nutch, we propose that the contentType and pathSuffix attributes
> >>>> should be removed from the plugin.xml schema.
> >>>> 
> >>>> Other than the plugin.xml file schema change, this capability
> >>>> addition will simply control the order in which parsing plugins
> >> get
> >>>> called during fetching activities. It won’t directly
> impact the
> >>>> segments stored, or the webapp, or any of the main components of
> >>>> Nutch.
> >>>> 
> >>>> ''Issues''
> >>>> 
> >>>> The proposed new capabilities should be first tested on local
> >>>> systems, and if successful, uploaded to JIRA, and verified
> >> against
> >>>> the latest SVNs.
> >>>> Unit tests should be written to verify appropriate plugin
> parsing
> >>>> order.
> >>>> Users will need to be notified in the Nutch tutorial and
> >>> instruction
> >>>> lists about how to set up the parsing plugin preferences prior
> to
> >>>> performing a fetch.
> >>>> 
> >>>> == Personnel ==
> >>>> 
> >>>>  *Jerome Charron
> >>>>  *Sébastien Le Callonnec
> >>>>  *Chris A. Mattmann
> >>>> 
> >>>> == Timeframe ==
> >>>> 
> >>> 
> >> === message truncated ===
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >
>
___________________________________________________________________________
> >> 
> >> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> >> Messenger 
> >> Téléchargez cette version sur http://fr.messenger.yahoo.com
> >> 
> > 
> 
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>  
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> Phone:  818-354-8810
> _______________________________________________________
>  
> Disclaimer:  The opinions presented within are my own and do not
> reflect
> those of either NASA, JPL, or the California Institute of Technology.
>  
>  
> 
> 
> 
> 


Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Otis,

 Point taken. In actuality since both convey the same information I think
that it's okay to support both, but by default say we could code the initial
plugins specified in parse-plugins.xml without the "order=" attribute. Fair
enough?

Cheers,
  Chris



On 9/15/05 3:23 PM, "ogjunk-nutch@yahoo.com" <og...@yahoo.com> wrote:

> Well, you have to tell users about order="N" somewhere in the docs.
> Instead of telling them about order="N", tell them that the order in
> XML matters.  Either case requires education, and the latter one
> requires less typing and avoids the case described in the proposal.
> 
> Otis
> 
> --- Sébastien LE CALLONNEC <sl...@yahoo.ie> wrote:
> 
>> Hi Otis,
>> 
>> 
>> This issue arose during our discussion for this proposal, and my
>> feeling was that the XML specification doesn't state that the order
>> is
>> significant in an XML file.  I therefore read the spec again, and
>> indeed didn't find anything on that subject...
>> 
>> I think it is somehow reasonable to consider that a parser _might_
>> return the elements in a different order—though, as I mentioned to
>> Chris & Jerome, that would be quite unheard of, and, to be honnest,
>> rather irritating.
>> 
>> What do you think?
>> 
>> 
>> Regards,
>> Sebastien.
>> 
>> 
>> 
>>> Quick comment about order="N" and the paragraph that describes how
>> to
>>> deal with cases where people mess things up and enter multiple
>>> plugins
>>> for the same content type and the same order:
>>> 
>>> - Why is the order attribute even needed?  It looks like a
>> redundant
>>> piece of information - why not derive order from the order of
>> plugin
>>> definitions in the XML file?
>>> 
>>> For instance:
>>> Instead of this:
>>> 
>>>   <mimeType name="*">
>>>       <plugin id=”parse-text” order=”1”/>
>>>       <plugin id=”another-one-default-parser” order=”2”/>
>>>      ....
>>>   </mimeType>
>>> 
>>> We have this:
>>> 
>>>   <mimeType name="*">
>>>       <plugin id=”parse-text”/>
>>>       <plugin id=”another-one-default-parser”/>
>>>      ....
>>>   </mimeType>
>>> 
>>> parse-text first, another-one-default-parser second.  Less typing,
>>> and
>>> we avoid the case of equal ordering all together.
>>> 
>>> Otis
>>> 
>>> 
>>> --- Apache Wiki <wi...@apache.org> wrote:
>>> 
>>>> Dear Wiki user,
>>>> 
>>>> You have subscribed to a wiki page or wiki category on "Nutch
>> Wiki"
>>>> for change notification.
>>>> 
>>>> The following page has been changed by ChrisMattmann:
>>>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
>>>> 
>>>> The comment on the change is:
>>>> Initial Draft of ParserFactoryImprovementProposal
>>>> 
>>>> New page:
>>>> = Parser Factory Improvement Proposal =
>>>> 
>>>> 
>>>> == Summary of Issue ==
>>>> Currently Nutch provides a plugin mechanism wherein plugins
>>> register
>>>> certain metadata about themselves, including their id, classname,
>>> and
>>>> so forth. In particular, the set of parsing plugins register
>> which
>>>> contentTypes and file suffixes they can support with a
>>>> PluginRepository.
>>>> 
>>>> One “adopted practice� in current Nutch parsing plugins
>>>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
>> has
>>>> also been to verify that the content type passed to it during a
>>> fetch
>>>> is indeed one of the contentTypes that it supports (be it
>>>> application/xml, or application/pdf, etc.). This practice is
>>>> cumbersome for a few reasons:
>>>> 
>>>>  *Any updates to supported content types for a parsing plugin
>> will
>>>> require a recompilation of the plugin code
>>>>  *Checking for “hard coded� content types within the parsing
>>>> plugin is a duplication of information that already exists in the
>>>> plugin’s descriptor file, plugin.xml
>>>>  *By the time that content gets to a parsing plugin, (e.g., the
>>>> parsing plugin is returned by the ParserFactory, and provided
>>> content
>>>> during a fetch), the ParsingFactory should have already ensured
>>> that
>>>> the appropriate plugin is getting called for a particular
>>>> contentType.
>>>> 
>>>> In addition to this problem is the fact that several parsing
>>> plugins
>>>> may all support many of the same content types. For instance, the
>>>> parse-js plugin may be the only well suited parsing plugin for
>>>> javascript, but perhaps it may also provided a good enough
>>> heuristic
>>>> parser for plain text as well, and so it may support both types.
>>>> However, there may be a parsing plugin for text (which there
>> is!),
>>>> parse-text, whose primary purpose is to parse plain text as well.
>>>> 
>>>> == Suggested Remedy ==
>>>> To deal with ensuring the desired parsing plugin is called for
>> the
>>>> appropriate content type, and to in effect, “kill two birds
>> with
>>>> one stone�, we propose that there be a parsing plugin
>> preference
>>>> list for each content type that Nutch knows how to handle, i.e.,
>>> each
>>>> content type available via the mimeType system. Therefore, during
>> a
>>>> fetch, once the appropriate mimeType has been determined for
>>> content,
>>>> and the ParserFactory is tasked with returning a parsing plugin,
>>> the
>>>> ParserFactory should consult a preference list for that
>>> contentType,
>>>> allowing it to determine which plugin has the highest preference
>>> for
>>>> the contentType. That parsing plugin should be returned via the
>>>> ParserFactory to the fetcher. If there is any problem using the
>>>> initial returned parsing plugin for a particular contentType
>> (i.e.,
>>>> if a ParseException is throw during the parser, or a null
>>> ParseStatus
>>>> is returned), then the ParserFactory should be called again, this
>>>> time asking for the “next highest ranked
>>>>  � plugin for that contentType. Such a process should repeat on
>>> and
>>>> on until the parse is successful.
>>>> 
>>>> We propose that the “plugin preference list� should be a
>>> separate
>>>> file that lives in $NUTCH_HOME/conf called
>> “parse-plugins.xml�.
>>>> The format of the file (full DTD to be developed during coding)
>>>> should be something like: {{{
>>>> 
>>>> <parse-plugins>
>>>>   <default pluginname=�parse-text�/>
>>>>   <fileType name=�powerpoint�>
>>>>    <mimeTypes>
>>>>     <mimeType name=�application/pdf� />
>>>>     <mimeType name=�application/x-pdf� />
>>>>     …
>>>>    </mimeTypes>
>>>> 
>>>>    <plugins>
>>>> 
>>>>       <plugin name=�parse-pdf� order=�1�/>
>>>>       <plugin name=�parse-pdf-worse� order=�2�/>
>>>>      …
>>>>    </plugins>
>>>>   </fileType>
>>>>     …
>>>> </parse-plugins>
>>>> 
>>>> }}}
>>>> 
>>>> 
>>>> One of the main impacts of having a file like parse-plugins.xml
>> is
>>>> that no longer should the pathSuffix="" be part of the plugin.xml
>>>> descriptor. We propose to move that out of plugin.xml and into
>> the
>>>> mime-types.xml file.
>>>> 
>>>> == Architectural Impact ==
>>>> 
>>>> === Components ===
>>>>  *Fetcher
>>>>  *PluginSystem
>>>>  *ParserFactory
>>>> 
>>>> === Impact on current releases of Nutch ===
>>>> 
>>>> ''Incompatibilities''
>>>> 
>>>> By moving the contentType and pathSuffix out of the plugin.xml
>>> file,
>>>> this would create an updated version of the plugin.xml descriptor
>>>> schema for each plugin. To lessen the effect on previous and
>>>> near-term releases of Nutch this information could be left as an
>>>> option in the plugin.xml schema, but marked as “deprecated�
>> to
>>>> let people know that this functionality isn’t part of the parse
>>>> plugin identification process anymore, but it is left in the
>> schema
>>>> so as not to create incompatibilities with the plugin.xml files
>>> that
>>>> people have already wrote. However, ultimately in future releases
>>> of
>>>> Nutch, we propose that the contentType and pathSuffix attributes
>>>> should be removed from the plugin.xml schema.
>>>> 
>>>> Other than the plugin.xml file schema change, this capability
>>>> addition will simply control the order in which parsing plugins
>> get
>>>> called during fetching activities. It won’t directly impact the
>>>> segments stored, or the webapp, or any of the main components of
>>>> Nutch.
>>>> 
>>>> ''Issues''
>>>> 
>>>> The proposed new capabilities should be first tested on local
>>>> systems, and if successful, uploaded to JIRA, and verified
>> against
>>>> the latest SVNs.
>>>> Unit tests should be written to verify appropriate plugin parsing
>>>> order.
>>>> Users will need to be notified in the Nutch tutorial and
>>> instruction
>>>> lists about how to set up the parsing plugin preferences prior to
>>>> performing a fetch.
>>>> 
>>>> == Personnel ==
>>>> 
>>>>  *Jerome Charron
>>>>  *Sébastien Le Callonnec
>>>>  *Chris A. Mattmann
>>>> 
>>>> == Timeframe ==
>>>> 
>>> 
>> === message truncated ===
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> ___________________________________________________________________________
>> 
>> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
>> Messenger 
>> Téléchargez cette version sur http://fr.messenger.yahoo.com
>> 
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 




Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by og...@yahoo.com.
Well, you have to tell users about order="N" somewhere in the docs. 
Instead of telling them about order="N", tell them that the order in
XML matters.  Either case requires education, and the latter one
requires less typing and avoids the case described in the proposal.

Otis

--- Sébastien LE CALLONNEC <sl...@yahoo.ie> wrote:

> Hi Otis,
> 
> 
> This issue arose during our discussion for this proposal, and my
> feeling was that the XML specification doesn't state that the order
> is
> significant in an XML file.  I therefore read the spec again, and
> indeed didn't find anything on that subject...
> 
> I think it is somehow reasonable to consider that a parser _might_
> return the elements in a different order—though, as I mentioned to
> Chris & Jerome, that would be quite unheard of, and, to be honnest,
> rather irritating.
> 
> What do you think?
> 
> 
> Regards,
> Sebastien.
> 
> 
> 
> > Quick comment about order="N" and the paragraph that describes how
> to
> > deal with cases where people mess things up and enter multiple
> > plugins
> > for the same content type and the same order:
> > 
> > - Why is the order attribute even needed?  It looks like a
> redundant
> > piece of information - why not derive order from the order of
> plugin
> > definitions in the XML file?
> > 
> > For instance:
> > Instead of this:
> > 
> >   <mimeType name="*">
> >       <plugin id=”parse-text” order=”1”/>
> >       <plugin id=”another-one-default-parser” order=”2”/>
> >      ....
> >   </mimeType>
> > 
> > We have this:
> > 
> >   <mimeType name="*">
> >       <plugin id=”parse-text”/>
> >       <plugin id=”another-one-default-parser”/>
> >      ....
> >   </mimeType>
> > 
> > parse-text first, another-one-default-parser second.  Less typing,
> > and
> > we avoid the case of equal ordering all together.
> > 
> > Otis
> > 
> > 
> > --- Apache Wiki <wi...@apache.org> wrote:
> > 
> > > Dear Wiki user,
> > > 
> > > You have subscribed to a wiki page or wiki category on "Nutch
> Wiki"
> > > for change notification.
> > > 
> > > The following page has been changed by ChrisMattmann:
> > > http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
> > > 
> > > The comment on the change is:
> > > Initial Draft of ParserFactoryImprovementProposal
> > > 
> > > New page:
> > > = Parser Factory Improvement Proposal =
> > > 
> > > 
> > > == Summary of Issue ==
> > > Currently Nutch provides a plugin mechanism wherein plugins
> > register
> > > certain metadata about themselves, including their id, classname,
> > and
> > > so forth. In particular, the set of parsing plugins register
> which
> > > contentTypes and file suffixes they can support with a
> > > PluginRepository.
> > > 
> > > One “adopted practice” in current Nutch parsing plugins
> > > (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
> has
> > > also been to verify that the content type passed to it during a
> > fetch
> > > is indeed one of the contentTypes that it supports (be it
> > > application/xml, or application/pdf, etc.). This practice is
> > > cumbersome for a few reasons:
> > > 
> > >  *Any updates to supported content types for a parsing plugin
> will
> > > require a recompilation of the plugin code
> > >  *Checking for “hard coded” content types within the parsing
> > > plugin is a duplication of information that already exists in the
> > > plugin’s descriptor file, plugin.xml
> > >  *By the time that content gets to a parsing plugin, (e.g., the
> > > parsing plugin is returned by the ParserFactory, and provided
> > content
> > > during a fetch), the ParsingFactory should have already ensured
> > that
> > > the appropriate plugin is getting called for a particular
> > > contentType.
> > > 
> > > In addition to this problem is the fact that several parsing
> > plugins
> > > may all support many of the same content types. For instance, the
> > > parse-js plugin may be the only well suited parsing plugin for
> > > javascript, but perhaps it may also provided a good enough
> > heuristic
> > > parser for plain text as well, and so it may support both types.
> > > However, there may be a parsing plugin for text (which there
> is!),
> > > parse-text, whose primary purpose is to parse plain text as well.
> > > 
> > > == Suggested Remedy ==
> > > To deal with ensuring the desired parsing plugin is called for
> the
> > > appropriate content type, and to in effect, “kill two birds
> with
> > > one stone”, we propose that there be a parsing plugin
> preference
> > > list for each content type that Nutch knows how to handle, i.e.,
> > each
> > > content type available via the mimeType system. Therefore, during
> a
> > > fetch, once the appropriate mimeType has been determined for
> > content,
> > > and the ParserFactory is tasked with returning a parsing plugin,
> > the
> > > ParserFactory should consult a preference list for that
> > contentType,
> > > allowing it to determine which plugin has the highest preference
> > for
> > > the contentType. That parsing plugin should be returned via the
> > > ParserFactory to the fetcher. If there is any problem using the
> > > initial returned parsing plugin for a particular contentType
> (i.e.,
> > > if a ParseException is throw during the parser, or a null
> > ParseStatus
> > > is returned), then the ParserFactory should be called again, this
> > > time asking for the “next highest ranked
> > >  ” plugin for that contentType. Such a process should repeat on
> > and
> > > on until the parse is successful.
> > > 
> > > We propose that the “plugin preference list” should be a
> > separate
> > > file that lives in $NUTCH_HOME/conf called
> “parse-plugins.xml”.
> > > The format of the file (full DTD to be developed during coding)
> > > should be something like: {{{
> > > 
> > > <parse-plugins>
> > >   <default pluginname=”parse-text”/>
> > >   <fileType name=”powerpoint”>
> > >    <mimeTypes>
> > >     <mimeType name=”application/pdf” />
> > >     <mimeType name=”application/x-pdf” />
> > >     …
> > >    </mimeTypes>
> > > 
> > >    <plugins>
> > > 
> > >       <plugin name=”parse-pdf” order=”1”/>
> > >       <plugin name=”parse-pdf-worse” order=”2”/>
> > >      …
> > >    </plugins>
> > >   </fileType>
> > >     …
> > > </parse-plugins>
> > > 
> > > }}}
> > > 
> > > 
> > > One of the main impacts of having a file like parse-plugins.xml
> is
> > > that no longer should the pathSuffix="" be part of the plugin.xml
> > > descriptor. We propose to move that out of plugin.xml and into
> the
> > > mime-types.xml file.
> > > 
> > > == Architectural Impact ==
> > > 
> > > === Components ===
> > >  *Fetcher
> > >  *PluginSystem
> > >  *ParserFactory
> > > 
> > > === Impact on current releases of Nutch ===
> > > 
> > > ''Incompatibilities''
> > > 
> > > By moving the contentType and pathSuffix out of the plugin.xml
> > file,
> > > this would create an updated version of the plugin.xml descriptor
> > > schema for each plugin. To lessen the effect on previous and
> > > near-term releases of Nutch this information could be left as an
> > > option in the plugin.xml schema, but marked as “deprecated”
> to
> > > let people know that this functionality isn’t part of the parse
> > > plugin identification process anymore, but it is left in the
> schema
> > > so as not to create incompatibilities with the plugin.xml files
> > that
> > > people have already wrote. However, ultimately in future releases
> > of
> > > Nutch, we propose that the contentType and pathSuffix attributes
> > > should be removed from the plugin.xml schema.
> > > 
> > > Other than the plugin.xml file schema change, this capability
> > > addition will simply control the order in which parsing plugins
> get
> > > called during fetching activities. It won’t directly impact the
> > > segments stored, or the webapp, or any of the main components of
> > > Nutch.
> > > 
> > > ''Issues''
> > > 
> > > The proposed new capabilities should be first tested on local
> > > systems, and if successful, uploaded to JIRA, and verified
> against
> > > the latest SVNs.
> > > Unit tests should be written to verify appropriate plugin parsing
> > > order.
> > > Users will need to be notified in the Nutch tutorial and
> > instruction
> > > lists about how to set up the parsing plugin preferences prior to
> > > performing a fetch.
> > > 
> > > == Personnel ==
> > > 
> > >  *Jerome Charron
> > >  *Sébastien Le Callonnec
> > >  *Chris A. Mattmann
> > > 
> > > == Timeframe ==
> > > 
> > 
> === message truncated ===
> 
> 
> 
> 	
> 
> 	
> 		
>
___________________________________________________________________________
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 


Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Otis,


This issue arose during our discussion for this proposal, and my
feeling was that the XML specification doesn't state that the order is
significant in an XML file.  I therefore read the spec again, and
indeed didn't find anything on that subject...

I think it is somehow reasonable to consider that a parser _might_
return the elements in a different order—though, as I mentioned to
Chris & Jerome, that would be quite unheard of, and, to be honnest,
rather irritating.

What do you think?


Regards,
Sebastien.



> Quick comment about order="N" and the paragraph that describes how to
> deal with cases where people mess things up and enter multiple
> plugins
> for the same content type and the same order:
> 
> - Why is the order attribute even needed?  It looks like a redundant
> piece of information - why not derive order from the order of plugin
> definitions in the XML file?
> 
> For instance:
> Instead of this:
> 
>   <mimeType name="*">
>       <plugin id=”parse-text” order=”1”/>
>       <plugin id=”another-one-default-parser” order=”2”/>
>      ....
>   </mimeType>
> 
> We have this:
> 
>   <mimeType name="*">
>       <plugin id=”parse-text”/>
>       <plugin id=”another-one-default-parser”/>
>      ....
>   </mimeType>
> 
> parse-text first, another-one-default-parser second.  Less typing,
> and
> we avoid the case of equal ordering all together.
> 
> Otis
> 
> 
> --- Apache Wiki <wi...@apache.org> wrote:
> 
> > Dear Wiki user,
> > 
> > You have subscribed to a wiki page or wiki category on "Nutch Wiki"
> > for change notification.
> > 
> > The following page has been changed by ChrisMattmann:
> > http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
> > 
> > The comment on the change is:
> > Initial Draft of ParserFactoryImprovementProposal
> > 
> > New page:
> > = Parser Factory Improvement Proposal =
> > 
> > 
> > == Summary of Issue ==
> > Currently Nutch provides a plugin mechanism wherein plugins
> register
> > certain metadata about themselves, including their id, classname,
> and
> > so forth. In particular, the set of parsing plugins register which
> > contentTypes and file suffixes they can support with a
> > PluginRepository.
> > 
> > One “adopted practice” in current Nutch parsing plugins
> > (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has
> > also been to verify that the content type passed to it during a
> fetch
> > is indeed one of the contentTypes that it supports (be it
> > application/xml, or application/pdf, etc.). This practice is
> > cumbersome for a few reasons:
> > 
> >  *Any updates to supported content types for a parsing plugin will
> > require a recompilation of the plugin code
> >  *Checking for “hard coded” content types within the parsing
> > plugin is a duplication of information that already exists in the
> > plugin’s descriptor file, plugin.xml
> >  *By the time that content gets to a parsing plugin, (e.g., the
> > parsing plugin is returned by the ParserFactory, and provided
> content
> > during a fetch), the ParsingFactory should have already ensured
> that
> > the appropriate plugin is getting called for a particular
> > contentType.
> > 
> > In addition to this problem is the fact that several parsing
> plugins
> > may all support many of the same content types. For instance, the
> > parse-js plugin may be the only well suited parsing plugin for
> > javascript, but perhaps it may also provided a good enough
> heuristic
> > parser for plain text as well, and so it may support both types.
> > However, there may be a parsing plugin for text (which there is!),
> > parse-text, whose primary purpose is to parse plain text as well.
> > 
> > == Suggested Remedy ==
> > To deal with ensuring the desired parsing plugin is called for the
> > appropriate content type, and to in effect, “kill two birds with
> > one stone”, we propose that there be a parsing plugin preference
> > list for each content type that Nutch knows how to handle, i.e.,
> each
> > content type available via the mimeType system. Therefore, during a
> > fetch, once the appropriate mimeType has been determined for
> content,
> > and the ParserFactory is tasked with returning a parsing plugin,
> the
> > ParserFactory should consult a preference list for that
> contentType,
> > allowing it to determine which plugin has the highest preference
> for
> > the contentType. That parsing plugin should be returned via the
> > ParserFactory to the fetcher. If there is any problem using the
> > initial returned parsing plugin for a particular contentType (i.e.,
> > if a ParseException is throw during the parser, or a null
> ParseStatus
> > is returned), then the ParserFactory should be called again, this
> > time asking for the “next highest ranked
> >  ” plugin for that contentType. Such a process should repeat on
> and
> > on until the parse is successful.
> > 
> > We propose that the “plugin preference list” should be a
> separate
> > file that lives in $NUTCH_HOME/conf called “parse-plugins.xml”.
> > The format of the file (full DTD to be developed during coding)
> > should be something like: {{{
> > 
> > <parse-plugins>
> >   <default pluginname=”parse-text”/>
> >   <fileType name=”powerpoint”>
> >    <mimeTypes>
> >     <mimeType name=”application/pdf” />
> >     <mimeType name=”application/x-pdf” />
> >     …
> >    </mimeTypes>
> > 
> >    <plugins>
> > 
> >       <plugin name=”parse-pdf” order=”1”/>
> >       <plugin name=”parse-pdf-worse” order=”2”/>
> >      …
> >    </plugins>
> >   </fileType>
> >     …
> > </parse-plugins>
> > 
> > }}}
> > 
> > 
> > One of the main impacts of having a file like parse-plugins.xml is
> > that no longer should the pathSuffix="" be part of the plugin.xml
> > descriptor. We propose to move that out of plugin.xml and into the
> > mime-types.xml file.
> > 
> > == Architectural Impact ==
> > 
> > === Components ===
> >  *Fetcher
> >  *PluginSystem
> >  *ParserFactory
> > 
> > === Impact on current releases of Nutch ===
> > 
> > ''Incompatibilities''
> > 
> > By moving the contentType and pathSuffix out of the plugin.xml
> file,
> > this would create an updated version of the plugin.xml descriptor
> > schema for each plugin. To lessen the effect on previous and
> > near-term releases of Nutch this information could be left as an
> > option in the plugin.xml schema, but marked as “deprecated” to
> > let people know that this functionality isn’t part of the parse
> > plugin identification process anymore, but it is left in the schema
> > so as not to create incompatibilities with the plugin.xml files
> that
> > people have already wrote. However, ultimately in future releases
> of
> > Nutch, we propose that the contentType and pathSuffix attributes
> > should be removed from the plugin.xml schema.
> > 
> > Other than the plugin.xml file schema change, this capability
> > addition will simply control the order in which parsing plugins get
> > called during fetching activities. It won’t directly impact the
> > segments stored, or the webapp, or any of the main components of
> > Nutch.
> > 
> > ''Issues''
> > 
> > The proposed new capabilities should be first tested on local
> > systems, and if successful, uploaded to JIRA, and verified against
> > the latest SVNs.
> > Unit tests should be written to verify appropriate plugin parsing
> > order.
> > Users will need to be notified in the Nutch tutorial and
> instruction
> > lists about how to set up the parsing plugin preferences prior to
> > performing a fetch.
> > 
> > == Personnel ==
> > 
> >  *Jerome Charron
> >  *Sébastien Le Callonnec
> >  *Chris A. Mattmann
> > 
> > == Timeframe ==
> > 
> 
=== message truncated ===



	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Otis,



On 9/15/05 10:14 AM, "ogjunk-nutch@yahoo.com" <og...@yahoo.com>
wrote:

> Quick comment about order="N" and the paragraph that describes how to
> deal with cases where people mess things up and enter multiple plugins
> for the same content type and the same order:
> 
> - Why is the order attribute even needed?  It looks like a redundant
> piece of information - why not derive order from the order of plugin
> definitions in the XML file?

Well, yes and no. Having the "order=" attribute explicitly forces the user
to decide up front what the order should be. By having order be semantically
derived from the order in which the tags are parsed, you remove that
information from the user up front, and assume they know that order matters.

However, in the interest of time and brevity, we decided that it would be
nice to support both options, that is, if they specify an order, then go
ahead and accept that, otherwise, if they don't specify one, then the
ordering which you suggest is applied.

We think that such a solution affords both types of users: users that like
to explicitly spell out options in their XML, and on the other hand, users
which like the short hand of just doing the xml attributes in order.
We are open to suggestions on this though...

Thanks for your comments Otis, and for reading the proposal!

Cheers,
  Chris

> 
> For instance:
> Instead of this:
> 
>   <mimeType name="*">
>       <plugin id=”parse-text” order=”1”/>
>       <plugin id=”another-one-default-parser” order=”2”/>
>      ....
>   </mimeType>
> 
> We have this:
> 
>   <mimeType name="*">
>       <plugin id=”parse-text”/>
>       <plugin id=”another-one-default-parser”/>
>      ....
>   </mimeType>
> 
> parse-text first, another-one-default-parser second.  Less typing, and
> we avoid the case of equal ordering all together.
> 
> Otis
> 
> 
> --- Apache Wiki <wi...@apache.org> wrote:
> 
>> Dear Wiki user,
>> 
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki"
>> for change notification.
>> 
>> The following page has been changed by ChrisMattmann:
>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
>> 
>> The comment on the change is:
>> Initial Draft of ParserFactoryImprovementProposal
>> 
>> New page:
>> = Parser Factory Improvement Proposal =
>> 
>> 
>> == Summary of Issue ==
>> Currently Nutch provides a plugin mechanism wherein plugins register
>> certain metadata about themselves, including their id, classname, and
>> so forth. In particular, the set of parsing plugins register which
>> contentTypes and file suffixes they can support with a
>> PluginRepository.
>> 
>> One “adopted practice� in current Nutch parsing plugins
>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has
>> also been to verify that the content type passed to it during a fetch
>> is indeed one of the contentTypes that it supports (be it
>> application/xml, or application/pdf, etc.). This practice is
>> cumbersome for a few reasons:
>> 
>>  *Any updates to supported content types for a parsing plugin will
>> require a recompilation of the plugin code
>>  *Checking for “hard coded� content types within the parsing
>> plugin is a duplication of information that already exists in the
>> plugin’s descriptor file, plugin.xml
>>  *By the time that content gets to a parsing plugin, (e.g., the
>> parsing plugin is returned by the ParserFactory, and provided content
>> during a fetch), the ParsingFactory should have already ensured that
>> the appropriate plugin is getting called for a particular
>> contentType.
>> 
>> In addition to this problem is the fact that several parsing plugins
>> may all support many of the same content types. For instance, the
>> parse-js plugin may be the only well suited parsing plugin for
>> javascript, but perhaps it may also provided a good enough heuristic
>> parser for plain text as well, and so it may support both types.
>> However, there may be a parsing plugin for text (which there is!),
>> parse-text, whose primary purpose is to parse plain text as well.
>> 
>> == Suggested Remedy ==
>> To deal with ensuring the desired parsing plugin is called for the
>> appropriate content type, and to in effect, “kill two birds with
>> one stone�, we propose that there be a parsing plugin preference
>> list for each content type that Nutch knows how to handle, i.e., each
>> content type available via the mimeType system. Therefore, during a
>> fetch, once the appropriate mimeType has been determined for content,
>> and the ParserFactory is tasked with returning a parsing plugin, the
>> ParserFactory should consult a preference list for that contentType,
>> allowing it to determine which plugin has the highest preference for
>> the contentType. That parsing plugin should be returned via the
>> ParserFactory to the fetcher. If there is any problem using the
>> initial returned parsing plugin for a particular contentType (i.e.,
>> if a ParseException is throw during the parser, or a null ParseStatus
>> is returned), then the ParserFactory should be called again, this
>> time asking for the “next highest ranked
>>  � plugin for that contentType. Such a process should repeat on and
>> on until the parse is successful.
>> 
>> We propose that the “plugin preference list� should be a separate
>> file that lives in $NUTCH_HOME/conf called “parse-plugins.xml�.
>> The format of the file (full DTD to be developed during coding)
>> should be something like: {{{
>> 
>> <parse-plugins>
>>   <default pluginname=�parse-text�/>
>>   <fileType name=�powerpoint�>
>>    <mimeTypes>
>>     <mimeType name=�application/pdf� />
>>     <mimeType name=�application/x-pdf� />
>>     …
>>    </mimeTypes>
>> 
>>    <plugins>
>> 
>>       <plugin name=�parse-pdf� order=�1�/>
>>       <plugin name=�parse-pdf-worse� order=�2�/>
>>      …
>>    </plugins>
>>   </fileType>
>>     …
>> </parse-plugins>
>> 
>> }}}
>> 
>> 
>> One of the main impacts of having a file like parse-plugins.xml is
>> that no longer should the pathSuffix="" be part of the plugin.xml
>> descriptor. We propose to move that out of plugin.xml and into the
>> mime-types.xml file.
>> 
>> == Architectural Impact ==
>> 
>> === Components ===
>>  *Fetcher
>>  *PluginSystem
>>  *ParserFactory
>> 
>> === Impact on current releases of Nutch ===
>> 
>> ''Incompatibilities''
>> 
>> By moving the contentType and pathSuffix out of the plugin.xml file,
>> this would create an updated version of the plugin.xml descriptor
>> schema for each plugin. To lessen the effect on previous and
>> near-term releases of Nutch this information could be left as an
>> option in the plugin.xml schema, but marked as “deprecated� to
>> let people know that this functionality isn’t part of the parse
>> plugin identification process anymore, but it is left in the schema
>> so as not to create incompatibilities with the plugin.xml files that
>> people have already wrote. However, ultimately in future releases of
>> Nutch, we propose that the contentType and pathSuffix attributes
>> should be removed from the plugin.xml schema.
>> 
>> Other than the plugin.xml file schema change, this capability
>> addition will simply control the order in which parsing plugins get
>> called during fetching activities. It won’t directly impact the
>> segments stored, or the webapp, or any of the main components of
>> Nutch.
>> 
>> ''Issues''
>> 
>> The proposed new capabilities should be first tested on local
>> systems, and if successful, uploaded to JIRA, and verified against
>> the latest SVNs.
>> Unit tests should be written to verify appropriate plugin parsing
>> order.
>> Users will need to be notified in the Nutch tutorial and instruction
>> lists about how to set up the parsing plugin preferences prior to
>> performing a fetch.
>> 
>> == Personnel ==
>> 
>>  *Jerome Charron
>>  *Sébastien Le Callonnec
>>  *Chris A. Mattmann
>> 
>> == Timeframe ==
>> 
>>  *Begin work the weekend of 9/9
>>  *Complete first prototype patches to JIRA by end of week, 9/18
>>  *Test against latest SVNs of Nutch, by 9/25
>>  *Delivery of operational capability, by 10/1
>> 
>> == Affected files ==
>>  *PluginRepository.java
>>  *PluginManifestParser.java
>>  *ParserFactory.java
>>  *plugin.xml descriptor files
>>  *files in package {{{org.apache.nutch.util.mime}}}
>> 
>> 
>> -------------------------------------------------------
>> SF.Net email is sponsored by:
>> Tame your development challenges with Apache's Geronimo App Server.
>> Download it for free - -and be entered to win a 42" plasma tv or your
>> very
>> own Sony(tm)PSP.  Click here to play:
>> http://sourceforge.net/geronimo.php
>> _______________________________________________
>> Nutch-cvs mailing list
>> Nutch-cvs@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nutch-cvs
>> 
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 




Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Posted by og...@yahoo.com.
Quick comment about order="N" and the paragraph that describes how to
deal with cases where people mess things up and enter multiple plugins
for the same content type and the same order:

- Why is the order attribute even needed?  It looks like a redundant
piece of information - why not derive order from the order of plugin
definitions in the XML file?

For instance:
Instead of this:

  <mimeType name="*">
      <plugin id=”parse-text” order=”1”/>
      <plugin id=”another-one-default-parser” order=”2”/>
     ....
  </mimeType>

We have this:

  <mimeType name="*">
      <plugin id=”parse-text”/>
      <plugin id=”another-one-default-parser”/>
     ....
  </mimeType>

parse-text first, another-one-default-parser second.  Less typing, and
we avoid the case of equal ordering all together.

Otis


--- Apache Wiki <wi...@apache.org> wrote:

> Dear Wiki user,
> 
> You have subscribed to a wiki page or wiki category on "Nutch Wiki"
> for change notification.
> 
> The following page has been changed by ChrisMattmann:
> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
> 
> The comment on the change is:
> Initial Draft of ParserFactoryImprovementProposal
> 
> New page:
> = Parser Factory Improvement Proposal =
> 
> 
> == Summary of Issue ==
> Currently Nutch provides a plugin mechanism wherein plugins register
> certain metadata about themselves, including their id, classname, and
> so forth. In particular, the set of parsing plugins register which
> contentTypes and file suffixes they can support with a
> PluginRepository.
> 
> One “adopted practice” in current Nutch parsing plugins
> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has
> also been to verify that the content type passed to it during a fetch
> is indeed one of the contentTypes that it supports (be it
> application/xml, or application/pdf, etc.). This practice is
> cumbersome for a few reasons:
> 
>  *Any updates to supported content types for a parsing plugin will
> require a recompilation of the plugin code
>  *Checking for “hard coded” content types within the parsing
> plugin is a duplication of information that already exists in the
> plugin’s descriptor file, plugin.xml
>  *By the time that content gets to a parsing plugin, (e.g., the
> parsing plugin is returned by the ParserFactory, and provided content
> during a fetch), the ParsingFactory should have already ensured that
> the appropriate plugin is getting called for a particular
> contentType.
> 
> In addition to this problem is the fact that several parsing plugins
> may all support many of the same content types. For instance, the
> parse-js plugin may be the only well suited parsing plugin for
> javascript, but perhaps it may also provided a good enough heuristic
> parser for plain text as well, and so it may support both types.
> However, there may be a parsing plugin for text (which there is!),
> parse-text, whose primary purpose is to parse plain text as well.
> 
> == Suggested Remedy ==
> To deal with ensuring the desired parsing plugin is called for the
> appropriate content type, and to in effect, “kill two birds with
> one stone”, we propose that there be a parsing plugin preference
> list for each content type that Nutch knows how to handle, i.e., each
> content type available via the mimeType system. Therefore, during a
> fetch, once the appropriate mimeType has been determined for content,
> and the ParserFactory is tasked with returning a parsing plugin, the
> ParserFactory should consult a preference list for that contentType,
> allowing it to determine which plugin has the highest preference for
> the contentType. That parsing plugin should be returned via the
> ParserFactory to the fetcher. If there is any problem using the
> initial returned parsing plugin for a particular contentType (i.e.,
> if a ParseException is throw during the parser, or a null ParseStatus
> is returned), then the ParserFactory should be called again, this
> time asking for the “next highest ranked
>  ” plugin for that contentType. Such a process should repeat on and
> on until the parse is successful.
> 
> We propose that the “plugin preference list” should be a separate
> file that lives in $NUTCH_HOME/conf called “parse-plugins.xml”.
> The format of the file (full DTD to be developed during coding)
> should be something like: {{{
> 
> <parse-plugins>
>   <default pluginname=”parse-text”/>
>   <fileType name=”powerpoint”>
>    <mimeTypes>
>     <mimeType name=”application/pdf” />
>     <mimeType name=”application/x-pdf” />
>     …
>    </mimeTypes>
> 
>    <plugins>
> 
>       <plugin name=”parse-pdf” order=”1”/>
>       <plugin name=”parse-pdf-worse” order=”2”/>
>      …
>    </plugins>
>   </fileType>
>     …
> </parse-plugins>
> 
> }}}
> 
> 
> One of the main impacts of having a file like parse-plugins.xml is
> that no longer should the pathSuffix="" be part of the plugin.xml
> descriptor. We propose to move that out of plugin.xml and into the
> mime-types.xml file.
> 
> == Architectural Impact ==
> 
> === Components ===
>  *Fetcher
>  *PluginSystem
>  *ParserFactory
> 
> === Impact on current releases of Nutch ===
> 
> ''Incompatibilities''
> 
> By moving the contentType and pathSuffix out of the plugin.xml file,
> this would create an updated version of the plugin.xml descriptor
> schema for each plugin. To lessen the effect on previous and
> near-term releases of Nutch this information could be left as an
> option in the plugin.xml schema, but marked as “deprecated” to
> let people know that this functionality isn’t part of the parse
> plugin identification process anymore, but it is left in the schema
> so as not to create incompatibilities with the plugin.xml files that
> people have already wrote. However, ultimately in future releases of
> Nutch, we propose that the contentType and pathSuffix attributes
> should be removed from the plugin.xml schema.
> 
> Other than the plugin.xml file schema change, this capability
> addition will simply control the order in which parsing plugins get
> called during fetching activities. It won’t directly impact the
> segments stored, or the webapp, or any of the main components of
> Nutch.
> 
> ''Issues''
> 
> The proposed new capabilities should be first tested on local
> systems, and if successful, uploaded to JIRA, and verified against
> the latest SVNs.
> Unit tests should be written to verify appropriate plugin parsing
> order.
> Users will need to be notified in the Nutch tutorial and instruction
> lists about how to set up the parsing plugin preferences prior to
> performing a fetch.
> 
> == Personnel ==
> 
>  *Jerome Charron
>  *Sébastien Le Callonnec
>  *Chris A. Mattmann
> 
> == Timeframe ==
> 
>  *Begin work the weekend of 9/9
>  *Complete first prototype patches to JIRA by end of week, 9/18
>  *Test against latest SVNs of Nutch, by 9/25
>  *Delivery of operational capability, by 10/1
> 
> == Affected files ==
>  *PluginRepository.java
>  *PluginManifestParser.java
>  *ParserFactory.java
>  *plugin.xml descriptor files
>  *files in package {{{org.apache.nutch.util.mime}}}
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. 
> Download it for free - -and be entered to win a 42" plasma tv or your
> very
> own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________
> Nutch-cvs mailing list
> Nutch-cvs@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-cvs
>