You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2011/12/29 09:49:30 UTC

[jira] [Created] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Webgraph should remove deleted pages from segment input
-------------------------------------------------------

                 Key: NUTCH-1239
                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
             Project: Nutch
          Issue Type: Improvement
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma


Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178400#comment-13178400 ] 

Hudson commented on NUTCH-1239:
-------------------------------

Integrated in nutch-trunk-maven #88 (See [https://builds.apache.org/job/nutch-trunk-maven/88/])
    NUTCH-1239 Webgraph should remove deleted pages from segment input

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1226406
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java

                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178358#comment-13178358 ] 

Markus Jelsma commented on NUTCH-1239:
--------------------------------------

I'll commit shortly unless there are objections.
thanks

                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177208#comment-13177208 ] 

Markus Jelsma commented on NUTCH-1239:
--------------------------------------

Oh, i haven't added the new config option yet. However, i'll to that for the all linkrank settings prior to commit.
                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178614#comment-13178614 ] 

Hudson commented on NUTCH-1239:
-------------------------------

Integrated in Nutch-trunk #1713 (See [https://builds.apache.org/job/Nutch-trunk/1713/])
    NUTCH-1239 Webgraph should remove deleted pages from segment input

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1226406
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java

                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1239.
----------------------------------

    Resolution: Fixed

Committed for 1.5 in rev. 1226406.
                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1239) Webgraph should remove deleted pages from segment input

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1239:
---------------------------------

    Attachment: NUTCH-1239-1.5-1.patch

Patch for 1.5. Little review would be appreciated. I added a BooleanWritable(false) for keys that no longer exist based on their CrawlDatum.status. If the reducer picks up that field along with the linkdatum objects it rejects the entire key because it is gone.
I needed a GenericWritable and lucky for us we already have NutchWritable, hence adding those two new classes.
I tested it and it seems to work.
                
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
>                 Key: NUTCH-1239
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1239
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira