You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (Created) (JIRA)" <ji...@apache.org> on 2011/12/29 09:49:30 UTC
[jira] [Created] (NUTCH-1239) Webgraph should remove deleted pages
from segment input
Webgraph should remove deleted pages from segment input
-------------------------------------------------------
Key: NUTCH-1239
URL: https://issues.apache.org/jira/browse/NUTCH-1239
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted
pages from segment input
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178400#comment-13178400 ]
Hudson commented on NUTCH-1239:
-------------------------------
Integrated in nutch-trunk-maven #88 (See [https://builds.apache.org/job/nutch-trunk-maven/88/])
NUTCH-1239 Webgraph should remove deleted pages from segment input
markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1226406
Files :
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted
pages from segment input
Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178358#comment-13178358 ]
Markus Jelsma commented on NUTCH-1239:
--------------------------------------
I'll commit shortly unless there are objections.
thanks
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted
pages from segment input
Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177208#comment-13177208 ]
Markus Jelsma commented on NUTCH-1239:
--------------------------------------
Oh, i haven't added the new config option yet. However, i'll to that for the all linkrank settings prior to commit.
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted
pages from segment input
Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178614#comment-13178614 ]
Hudson commented on NUTCH-1239:
-------------------------------
Integrated in Nutch-trunk #1713 (See [https://builds.apache.org/job/Nutch-trunk/1713/])
NUTCH-1239 Webgraph should remove deleted pages from segment input
markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1226406
Files :
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1239) Webgraph should remove deleted pages
from segment input
Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1239.
----------------------------------
Resolution: Fixed
Committed for 1.5 in rev. 1226406.
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1239) Webgraph should remove deleted pages
from segment input
Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1239:
---------------------------------
Attachment: NUTCH-1239-1.5-1.patch
Patch for 1.5. Little review would be appreciated. I added a BooleanWritable(false) for keys that no longer exist based on their CrawlDatum.status. If the reducer picks up that field along with the linkdatum objects it rejects the entire key because it is gone.
I needed a GenericWritable and lucky for us we already have NutchWritable, hence adding those two new classes.
I tested it and it seems to work.
> Webgraph should remove deleted pages from segment input
> -------------------------------------------------------
>
> Key: NUTCH-1239
> URL: https://issues.apache.org/jira/browse/NUTCH-1239
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Attachments: NUTCH-1239-1.5-1.patch
>
>
> Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira