You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2006/01/15 23:40:20 UTC
[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

     [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]

Gal Nitzan updated NUTCH-179:
-----------------------------

    Description: 
Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

  was:
Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.


> Proposition: Enable Nutch to use a parser plugin not just based on content type
> -------------------------------------------------------------------------------
>
>          Key: NUTCH-179
>          URL: http://issues.apache.org/jira/browse/NUTCH-179
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Gal Nitzan

>
> Somtime there are requirements of the "real world" (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed.
> Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score.
> Currently the ParserFactory looks for a plugin based only on the content type.
> Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type.
> The implementation shouldn be to complicated.
> Looking to hear more ideas.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira