You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mathijs Homminga (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2012/03/12 21:30:42 UTC

[jira] [Issue Comment Edited] (NUTCH-882) Design a Host table in GORA

    [ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227859#comment-13227859 ] 

Mathijs Homminga edited comment on NUTCH-882 at 3/12/12 8:29 PM:
-----------------------------------------------------------------

Hi guys,

I have second thoughts on implementing the NutchContext concept at this stage.

All Nutch processes are centered around the concept of a WebPage. And I agree, many of these processes and their plugins might benefit from additional input which is related to, but not directly part of a WebPage. Like host statistics, metadata or domain information.

The proposed NutchContext solution is elegant in the way that it makes this additional information available to plugins, in an extensible way. 
However, it indeed requires a big API break for plugins (since we don't use abstract base classes for all the plugins, we can't fix it there to keep them compatible).

I'm afraid that a patch that tries to implement the Host table and the NutchContext at the same time, will have a hard time to make it to the repository ;)

I propose to move the NutchContext approach to a new issue.
Plugins and other components can still use Host information by using the HostDB class directly to perform efficient host lookups when needed. We can then decide later to make this part of the NutchContext.

Agree?





                
      was (Author: mathijs.homminga):
    Hi guys,

I have second thoughts on implementing the NutchContext concept at this stage.

All Nutch processes are centered around the concept of a WebPage. And I agree, many of these processes and their plugins might benefit from additional input which is related to, but not directly part of a WebPage. Like host statistics, metadata or domain information.

The proposed NutchContext solution is elegant in the way that it makes this additional information available to plugins, in an extensible way. 
However, it indeed requires a big API break for plugins (since we don't use abstract base classes for all the plugins, we can't fix it there to keep them compatible).

I'm afraid that a patch that tries to implement the Host table and the NutchContext at the same time, will have a hard time to make it to the repository ;)

I propose to move the NutchContext approach to a new issue.
Plugins and other components can still use Host information by using the HostDB class directly to perform efficient host lookups when needed. We can then decide later to make this part of the NutchContext.

Agreed?





                  
> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: nutchgora
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: nutchgora
>
>         Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira