You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2005/10/19 23:36:45 UTC

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

    [ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332514 ] 

Doug Cutting commented on NUTCH-88:
-----------------------------------

I am seeing some problems using this.

First, the ParserFactory sometimes uses LOG.severe() which causes the Fetcher to exit.  Is there a reason this cannot be LOG.warning()?  LOG.severe() should only be used if you intend the application to exit.  This configuration problem does not seem to warrant that.  And I'm getting it with the default settings when an application/pdf is encountered.

The second problem I'm seeing is that most html pages are parsed by the ParseText parser.  I think this is because their HTTP content-type header is "text/html; charset=ISO-8859-1", which does not match "text/html".  Where should the content-type parameters be removed?


> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
>
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>     Assignee: Jerome Charron
>      Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the empty string is used.
> This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content).
> On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml => possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

Posted by Doug Cutting <cu...@nutch.org>.
Chris Mattmann wrote:
> I guess this is really a design issue in Nutch. Is there really any reason
> that the rest of the parsing plugins aren't enabled by default?

They used to all be enabled by default, but some were unreliable and 
would cause fetching to either run very slowly or to hang.  By forcing 
folks to enable these one-by-one Nutch is more predictable and reliable. 
  If folks find that a particular plugin is (a) generally useful, (b) 
fast, and (c) reliable, then please petition to have it added to the 
defaults.

Doug


RE: [jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Doug,

 I just noticed this comment from your original email:

> First, the ParserFactory sometimes uses LOG.severe() which causes the
> Fetcher to exit.  Is there a reason this cannot be LOG.warning()?
> LOG.severe() should only be used if you intend the application to exit.
> This configuration problem does not seem to warrant that.  And I'm getting
> it with the default settings when an application/pdf is encountered.

In fact, I can't speak for Jerome and Sebastien, but I actually intended the
application to exit in this case. Here is a snippet, taken from:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

///////////////////////////////////////////////////////////////////////////
If an activated parse plugin is not listed in the parse-plugins.xml, then it
won't get called for parsing. The purpose of the parse-plugins.xml file
would be to map parsing-plugin to contentType. Therefore, if an activated
plugin is not mapped to a content type, then it is "activated", but won't
get called. This is very similar to Apache HTTPD. See below:

//httpd.conf example
//add handler for php

LoadModule php4_module        libexec/httpd/libphp4.so

// map handler to mimeType
AddType application/x-httpd-php .php
AddType application/x-httpd-php-source .phps

AddHandler php-script   php
AddHandler php-script   phps

There are two different levels in the above example. First, the plugin is
"activated" in the LoadModule section. Then, the plugin is "mapped" to a
content type in the AddHandler section. We believe that this is the way to
go. Apache HTTPD is pervasive, and its model is well understood by many of
the same folks who would want to use Nutch. Although we realize that this is
a change from the way that Nutch currently works, and that people don't like
change, we believe that this change is entirely needful and represents
something that Nutch should adopt.
///////////////////////////////////////////////////////////////////////

The above case you mention with respect to the application/pdf documents
happens because in the parse-plugins.xml file there is a mapping of the
parse-pdf plugin to the "application/pdf" mimeType, even though the
parse-pdf plugin isn't activated by default via the plugin.includes property
(note, this is the opposite case of the snippet that I pasted from the
ImprovementProposal off the Wiki above). Therein lays the problem. My idea
was that, similar to the above case in Apache HTTPD, if you map an
"unactivated plugin" to a mimeType via parse-plugins.xml, then really, there
is a configuration error there. I think that this is a LOG.severe()
configuration error because you really need to "activate" a plugin, before
you "map it" to a mime type. For example, why would you want to run a fetch
if you have plugins mapped to mimeTypes via parse-plugins.xml that will
never get called because they have never been activated? Before I run a
fetch, I want to make sure of two important things:

1. I have enabled the entire set of appropriate parse plugins for the
content that I want to fetch

2. I've mapped the enabled parsing plugins to the mimeTypes that they can
deal with (in order of preference)

If I ensure that I do both of these things, then we're fine in the above
case you mention with the PDF files. 

Now, I know that this is a somewhat different process than what people are
used to with Nutch. Totally understandable. But I think that the
improvements that are reaped in the ParserFactory by doing it this way far
outweigh the inconvenience of ensuring consistency between the
plugin.includes property in nutch-default.xml and the parse-plugins.xml
file. 

Of course, there is another issue. The current code committed in the trunk
causes the fetcher to exit right out of the box for certain content types,
because, as far as I can tell, the only enabled parse plugins out of the box
are:

parse-(text|html|js)

I guess this is really a design issue in Nutch. Is there really any reason
that the rest of the parsing plugins aren't enabled by default? I mean, I
guess you guys want to go with the "smallest set" of parsing plugins that
makes Nutch a functional search engine out of the box, no? If so, then I
understand only having these parsing plugins enabled. But for instance, I
would say that many of the other parsing plugins, being committed to the
trunk and included in existing Nutch releases so far (e.g.,
parse-ext|mp3|mspowerpoint|msword|pdf|rss|rtf) are tested enough to be
enabled by default, right? If the answer to that lies in a requirement
similar to what I mentioned, i.e., you want to go with the "smallest set" of
parse plugins out of the box, then two ways can deal with what's in trunk:

1. What you suggested, changing the LOG level to warning, instead of SEVERE,
which alleviates the out-of-the-box functionality problem, but also opens up
a problem where a user will wonder why the PDF content that he tried to
fetch didn't get parsed even though it was mapped correctly in
parse-plugins.xml (but not enabled via plugin.includes).

Or 

2. enabling the committed plugins by default in plugin.includes in
conf/nutch-default.xml in the trunk, or at least by default enabling all the
plugins which are currently listed in parse-plugins.xml in the trunk, which
are: parse-text|msword|pdf|rss|msexcel|mspowerpoint|zip|js|rtf|html|ext


Of course, it's up to you guys what you want to do, however, that's just my
two cents.


Take care,
   Chris




> 
> > Enhance ParserFactory plugin selection policy
> > ---------------------------------------------
> >
> >          Key: NUTCH-88
> >          URL: http://issues.apache.org/jira/browse/NUTCH-88
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer
> >     Versions: 0.7, 0.8-dev
> >     Reporter: Jerome Charron
> >     Assignee: Jerome Charron
> >      Fix For: 0.8-dev
> 
> >
> > The ParserFactory choose the Parser plugin to use based on the content-
> types and path-suffix defined in the parsers plugin.xml file.
> > The selection policy is as follow:
> > Content type has priority: the first plugin found whose "contentType"
> attribute matches the beginning of the content's type is used.
> > If none match, then the first whose "pathSuffix" attribute matches the
> end of the url's path is used.
> > If neither of these match, then the first plugin whose "pathSuffix" is
> the empty string is used.
> > This policy has a lot of problems when no matching is found, because a
> random parser is used (and there is a lot of chance this parser can't
> handle the content).
> > On the other hand, the content-type associated to a parser plugin is
> specified in the plugin.xml of each plugin (this is the value used by the
> ParserFactory), AND the code of each parser checks itself in its code if
> the content-type is ok (it uses an hard-coded content-type value, and not
> uses the value specified in the plugin.xml => possibility of missmatches
> between content-type hard-coded and content-type delcared in plugin.xml).
> > A complete list of problems and discussion aout this point is available
> in:
> >   * http://www.mail-archive.com/nutch-
> user%40lucene.apache.org/msg00744.html
> >   * http://www.mail-archive.com/nutch-
> dev%40lucene.apache.org/msg00789.html
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira