You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michael Chan (JIRA)" <ji...@apache.org> on 2009/02/28 18:42:12 UTC

[jira] Created: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Generation of multiple segments in multiple runs returns only 1 segment
-----------------------------------------------------------------------

                 Key: NUTCH-707
                 URL: https://issues.apache.org/jira/browse/NUTCH-707
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0
         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
            Reporter: Michael Chan
             Fix For: 0.9.0


To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.

For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.

It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated NUTCH-707:
-----------------------------------

    Fix Version/s:     (was: 0.9.0)
                   1.1

> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-707.
-----------------------------------

    Resolution: Fixed
      Assignee: Andrzej Bialecki 

> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Posted by "Michael Chan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Chan updated NUTCH-707:
-------------------------------

    Attachment: GeneratorDiff

> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>             Fix For: 0.9.0
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763994#action_12763994 ] 

Andrzej Bialecki  commented on NUTCH-707:
-----------------------------------------

Fixed - the bug was actually present in CrawlDbUpdater. Thanks!

> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764293#action_12764293 ] 

Hudson commented on NUTCH-707:
------------------------------

Integrated in Nutch-trunk #959 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/])
     Generation of multiple segments in multiple runs returns only 1 segment.


> Generation of multiple segments in multiple runs returns only 1 segment
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-707
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
>            Reporter: Michael Chan
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.1
>
>         Attachments: GeneratorDiff
>
>
> To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated.
> For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced.
> It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.