You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/08/01 16:08:09 UTC

[jira] [Created] (NUTCH-1074) topN is ignored with maxNumSegments

topN is ignored with maxNumSegments
-----------------------------------

                 Key: NUTCH-1074
                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.3
            Reporter: Markus Jelsma
             Fix For: 1.4


When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1074:
------------------------------------

    Assignee: Markus Jelsma

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Robert Thomson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107369#comment-13107369 ] 

Robert Thomson commented on NUTCH-1074:
---------------------------------------

As far as I can tell, when generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host.  The relative size of resulting segments will depend on the distribution of hosts in the crawldb.  topN only limits the mean size of the segments.

If generator.max.count is not set, each segment will contain topN records.

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112847#comment-13112847 ] 

Markus Jelsma commented on NUTCH-1074:
--------------------------------------

Yes! I overlooked generate.max.count and you're right. Could you attach your patch to the issue with a flag for approval of inclusion in Nutch? So we can test it more and include if all goes well.

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1074.
----------------------------------

    Resolution: Fixed

Committed for 1.4 in rev 1174689. Thanks Robert for contributing the patch.

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: generator_fix.patch
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113890#comment-13113890 ] 

Hudson commented on NUTCH-1074:
-------------------------------

Integrated in Nutch-branch-1.4 #15 (See [https://builds.apache.org/job/Nutch-branch-1.4/15/])
    NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count

markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174689
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java


> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: generator_fix.patch
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Robert Thomson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Thomson updated NUTCH-1074:
----------------------------------

    Attachment: generator_fix.patch

Patch to make generator.max.count and topN work together

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: generator_fix.patch
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098820#comment-13098820 ] 

Markus Jelsma commented on NUTCH-1074:
--------------------------------------

Finally got some numbers to share from a running test:

maxNumSegments = 3
topN = 250.000
Selector reduce output records = 750.000 

The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

1: 471.428
2: 171.562
3: 107.010

The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. 

What should we exactly expect?



> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1074.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: generator_fix.patch
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098820#comment-13098820 ] 

Markus Jelsma edited comment on NUTCH-1074 at 9/7/11 9:54 AM:
--------------------------------------------------------------

Finally got some numbers to share from a running test:

maxNumSegments = 3
topN = 250.000
Selector reduce output records = 750.000 
happen
The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

1: 471.428
2: 171.562
3: 107.010

The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. The generator partitions by host and has a limit on the number of hosts per queue so i would expect each segment to contain slightly less records than topN.

What should we exactly expect?



      was (Author: markus17):
    Finally got some numbers to share from a running test:

maxNumSegments = 3
topN = 250.000
Selector reduce output records = 750.000 

The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

1: 471.428
2: 171.562
3: 107.010

The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. 

What should we exactly expect?


  
> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Robert Thomson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107369#comment-13107369 ] 

Robert Thomson edited comment on NUTCH-1074 at 9/18/11 7:55 AM:
----------------------------------------------------------------

When generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host.  The relative size of resulting segments will depend on the distribution of hosts in the crawldb.  topN only limits the mean size of the segments.

If generator.max.count is not set, each segment will contain topN records.

Anyway, here's my fix.  When using generator.max.count, each segment will contain up to topN records with at most generator.max.count from any single host.

{code}
Index: src/java/org/apache/nutch/crawl/Generator.java
===================================================================
--- src/java/org/apache/nutch/crawl/Generator.java      (revision 1172165)
+++ src/java/org/apache/nutch/crawl/Generator.java      (working copy)
@@ -115,6 +115,7 @@
     private long limit;
     private long count;
     private HashMap<String,int[]> hostCounts = new HashMap<String,int[]>();
+    private int segCounts[];
     private int maxCount;
     private boolean byDomain = false;
     private Partitioner<Text,Writable> partitioner = new URLPartitioner();
@@ -155,6 +156,7 @@
       schedule = FetchScheduleFactory.getFetchSchedule(job);
       scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
       maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
+      segCounts = new int[maxNumSegments];
     }
 
     public void close() {}
@@ -269,6 +271,12 @@
           // increment hostCount
           hostCount[1]++;
 
+          // check if topN reached, select next segment if it is
+          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] < maxNumSegments) {
+            hostCount[0]++;
+            hostCount[1] = 0;
+          }
+
           // reached the limit of allowed URLs per host / domain
           // see if we can put it in the next segment?
           if (hostCount[1] > maxCount) {
@@ -285,7 +293,11 @@
             }
           }
           entry.segnum = new IntWritable(hostCount[0]);
-        } else entry.segnum = new IntWritable(currentsegmentnum);
+          segCounts[hostCount[0]-1]++;
+        } else {
+          entry.segnum = new IntWritable(currentsegmentnum);
+          segCounts[currentsegmentnum-1]++;
+        }
 
         output.collect(key, entry);
{code}

      was (Author: robthomson):
    As far as I can tell, when generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host.  The relative size of resulting segments will depend on the distribution of hosts in the crawldb.  topN only limits the mean size of the segments.

If generator.max.count is not set, each segment will contain topN records.
  
> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1074) topN is ignored with maxNumSegments

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098820#comment-13098820 ] 

Markus Jelsma edited comment on NUTCH-1074 at 9/7/11 9:56 AM:
--------------------------------------------------------------

Finally got some numbers to share from a running test:

maxNumSegments = 3
topN = 250.000
Selector reduce output records = 750.000 
happen
The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

1: 471.428
2: 171.562
3: 107.010

The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. The generator partitions by host and has a limit on the number of hosts per queue so i would expect each segment to contain slightly less records than topN but certainly not more than topN.

What should we exactly expect?



      was (Author: markus17):
    Finally got some numbers to share from a running test:

maxNumSegments = 3
topN = 250.000
Selector reduce output records = 750.000 
happen
The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

1: 471.428
2: 171.562
3: 107.010

The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. The generator partitions by host and has a limit on the number of hosts per queue so i would expect each segment to contain slightly less records than topN.

What should we exactly expect?


  
> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira