You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/02/19 11:16:34 UTC

[jira] [Created] (NUTCH-1282) linkdb scalability

linkdb scalability
------------------

                 Key: NUTCH-1282
                 URL: https://issues.apache.org/jira/browse/NUTCH-1282
             Project: Nutch
          Issue Type: Improvement
          Components: linkdb
    Affects Versions: 1.4
            Reporter: behnam nikbakht


as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. 
as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites.
so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable
in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex:
in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert)
here also some changes required to NutchDocument.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1282) linkdb scalability

Posted by "behnam nikbakht (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221580#comment-13221580 ] 

behnam nikbakht commented on NUTCH-1282:
----------------------------------------

another option is when we construct web graph for implementing advanced scoring methods, and then, we can extract anchors from inlinks in web graph and no need to linkdb.
                
> linkdb scalability
> ------------------
>
>                 Key: NUTCH-1282
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1282
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. 
> as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert)
> here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1282) linkdb scalability

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221598#comment-13221598 ] 

Markus Jelsma commented on NUTCH-1282:
--------------------------------------

There is an issue for that. In my opinion with that issue implemented the current linkdb can be deprecated.  Please check NUTCH-1181 if you have a patch for this.
                
> linkdb scalability
> ------------------
>
>                 Key: NUTCH-1282
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1282
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. 
> as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert)
> here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira