You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michael Chan (JIRA)" <ji...@apache.org> on 2009/02/28 18:42:12 UTC
[jira] Created: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Generation of multiple segments in multiple runs returns only 1 segment
-----------------------------------------------------------------------
Key: NUTCH-707
URL: https://issues.apache.org/jira/browse/NUTCH-707
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 0.9.0
Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
Reporter: Michael Chan
Fix For: 0.9.0
To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic updated NUTCH-707:
-----------------------------------
Fix Version/s: (was: 0.9.0)
1.1
> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
> Key: NUTCH-707
> URL: https://issues.apache.org/jira/browse/NUTCH-707
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
> Reporter: Michael Chan
> Fix For: 1.1
>
> Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-707.
-----------------------------------
Resolution: Fixed
Assignee: Andrzej Bialecki
> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
> Key: NUTCH-707
> URL: https://issues.apache.org/jira/browse/NUTCH-707
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
> Reporter: Michael Chan
> Assignee: Andrzej Bialecki
> Fix For: 1.1
>
> Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Posted by "Michael Chan (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Chan updated NUTCH-707:
-------------------------------
Attachment: GeneratorDiff
> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
> Key: NUTCH-707
> URL: https://issues.apache.org/jira/browse/NUTCH-707
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
> Reporter: Michael Chan
> Fix For: 0.9.0
>
> Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763994#action_12763994 ]
Andrzej Bialecki commented on NUTCH-707:
-----------------------------------------
Fixed - the bug was actually present in CrawlDbUpdater. Thanks!
> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
> Key: NUTCH-707
> URL: https://issues.apache.org/jira/browse/NUTCH-707
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
> Reporter: Michael Chan
> Assignee: Andrzej Bialecki
> Fix For: 1.1
>
> Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-707) Generation of multiple segments in
multiple runs returns only 1 segment
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764293#action_12764293 ]
Hudson commented on NUTCH-707:
------------------------------
Integrated in Nutch-trunk #959 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/])
Generation of multiple segments in multiple runs returns only 1 segment.
> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
> Key: NUTCH-707
> URL: https://issues.apache.org/jira/browse/NUTCH-707
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
> Reporter: Michael Chan
> Assignee: Andrzej Bialecki
> Fix For: 1.1
>
> Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.