You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Riyaz Shaik (JIRA)" <ji...@apache.org> on 2014/07/22 14:57:38 UTC

[jira] [Created] (NUTCH-1822) Page outlinks clearance is not appropriate

Riyaz Shaik created NUTCH-1822:
----------------------------------

             Summary: Page outlinks  clearance is not appropriate
                 Key: NUTCH-1822
                 URL: https://issues.apache.org/jira/browse/NUTCH-1822
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.1
         Environment: Nutch-2.1
Hadoop-0.20.205
HBase-0.90.6
hbase-gora-0.2.1
            Reporter: Riyaz Shaik


1. When a page is re-crawled and identified with new outlink urls along with the existing urls, old outlinks are getting removed and only new urls are updated to hbase. 
Ex:
 Crawl cycle 1 for www.123.com, identified outlinks are 
ol  --> abc.com 
ol --> pqr.com 
Crawlcyle 2 of same www.123.com, the outlinks are
(note that abc.com is removed and added with xyz.com) 
ol --> pqr.com 
ol --> xyz.com 
At the end of crawlcycle 2, base has only xyz.com as outlink
ol -->xyz.com

Expected:
ol --> pqr.com 
ol --> xyz.com 

2. If some of the outlinks of the page got removed and no new outlinks are added to the page then page re-crawl is not clearing the obsolete/removed outlinks from hbase.

Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
ol -->link1
ol-->link2
ol-->link3
Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
(Note: only removed the link2 no new links are added)
 ol-->link1
ol-->link3
 but the end of the cycle 2.,it has all the 3 outlinks in hbase
in habse:
ol -->link1
ol-->link2
ol-->link3

expected:
 ol-->link1
ol-->link3
As per the code ParseUtil.java, it seems to be removing the old links and insets onlythe new links. 
if (page.getOutlinks() != null) { page.getOutlinks().clear(); }

http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
Thanks
Riyaz





--
This message was sent by Atlassian JIRA
(v6.2#6252)