You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2005/04/17 12:06:59 UTC

[jira] Commented: (NUTCH-34) Parsing different content formats

     [ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_62996 ]
     
Andrzej Bialecki  commented on NUTCH-34:
----------------------------------------

Currently there is such a "registry", and it is built and maintained by PluginRepository.

So, it seems to me that the only change required here would be to add attributes to each plugin config file (and plugin interface) which inform all plugin users about the following:

* a boolean, whether the plugin can handle incomplete files or not.

* an int, setting the content size limit.

> Parsing different content formats
> ---------------------------------
>
>          Key: NUTCH-34
>          URL: http://issues.apache.org/jira/browse/NUTCH-34
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stephan Strittmatter
>     Priority: Trivial

>
> At the moment Nuch is set up to filter content by config the xml-config file.
> There it is also set global how many bytes are loaded.
> I think it yould be better to let the parser plugins "register" themselfe in some registry where every plugin could tell the fetcher, that:
> 1. this document type is wanted (because the parser plugin is 
>    installed and activated)
> 2. how much of the content is required (some plugins need the whole 
>    content and some not)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-34) Parsing different content formats

Posted by Jack Tang <hi...@gmail.com>.

Hi Andrzej 

For the second question. I don't think it is the "content size limit amount". 
In our CMS product, we need to index the content starts from
"<!--Indexware Content Starts Here-->" and  ends with "<!--Indexware
Content Ends Here-->". It is easy to change the HtmlParser ....

/Jack

On 4/17/05, Andrzej Bialecki  (JIRA) <ji...@apache.org> wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_62996 ]
> 
> Andrzej Bialecki  commented on NUTCH-34:
> ----------------------------------------
> 
> Currently there is such a "registry", and it is built and maintained by PluginRepository.
> 
> So, it seems to me that the only change required here would be to add attributes to each plugin config file (and plugin interface) which inform all plugin users about the following:
> 
> * a boolean, whether the plugin can handle incomplete files or not.
> 
> * an int, setting the content size limit.
> 
> > Parsing different content formats
> > ---------------------------------
> >
> >          Key: NUTCH-34
> >          URL: http://issues.apache.org/jira/browse/NUTCH-34
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Reporter: Stephan Strittmatter
> >     Priority: Trivial
> 
> >
> > At the moment Nuch is set up to filter content by config the xml-config file.
> > There it is also set global how many bytes are loaded.
> > I think it yould be better to let the parser plugins "register" themselfe in some registry where every plugin could tell the fetcher, that:
> > 1. this document type is wanted (because the parser plugin is
> >    installed and activated)
> > 2. how much of the content is required (some plugins need the whole
> >    content and some not)
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>   http://issues.apache.org/jira/secure/Administrators.jspa
> -
> If you want more information on JIRA, or have a bug to report see:
>   http://www.atlassian.com/software/jira
> 
>