You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2023/01/08 19:14:00 UTC

[jira] [Closed] (NUTCH-1822) Page outlinks clearance is not appropriate

     [ https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel closed NUTCH-1822.
----------------------------------

> Page outlinks  clearance is not appropriate
> -------------------------------------------
>
>                 Key: NUTCH-1822
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1822
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.1
>         Environment: Nutch-2.1
> Hadoop-0.20.205
> HBase-0.90.6
> hbase-gora-0.2.1
>            Reporter: Riyaz Shaik
>            Priority: Major
>
> 1. When a page is re-crawled and identified with new outlink urls along with the existing urls, old outlinks are getting removed and only new urls are updated to hbase. 
> Ex:
>  Crawl cycle 1 for www.123.com, identified outlinks are 
> ol  --> abc.com 
> ol --> pqr.com 
> Crawlcyle 2 of same www.123.com, the outlinks are
> (note that abc.com is removed and added with xyz.com) 
> ol --> pqr.com 
> ol --> xyz.com 
> At the end of crawlcycle 2, base has only xyz.com as outlink
> ol -->xyz.com
> Expected:
> ol --> pqr.com 
> ol --> xyz.com 
> 2. If some of the outlinks of the page got removed and no new outlinks are added to the page then page re-crawl is not clearing the obsolete/removed outlinks from hbase.
> Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
> ol -->link1
> ol-->link2
> ol-->link3
> Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
> (Note: only removed the link2 no new links are added)
>  ol-->link1
> ol-->link3
>  but the end of the cycle 2.,it has all the 3 outlinks in hbase
> in habse:
> ol -->link1
> ol-->link2
> ol-->link3
> expected:
>  ol-->link1
> ol-->link3
> As per the code ParseUtil.java, it seems to be removing the old links and insets onlythe new links. 
> if (page.getOutlinks() != null) { page.getOutlinks().clear(); }
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Thanks
> Riyaz



--
This message was sent by Atlassian Jira
(v8.20.10#820010)