You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Talat UYARER (JIRA)" <ji...@apache.org> on 2015/04/29 11:33:05 UTC

[jira] [Created] (NUTCH-2003) topN is not work correctly

Talat UYARER created NUTCH-2003:
-----------------------------------

             Summary: topN is not work correctly
                 Key: NUTCH-2003
                 URL: https://issues.apache.org/jira/browse/NUTCH-2003
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Talat UYARER
            Priority: Minor


I want to crawl top 1000 urls which are ordered by scores from webpage table. It doesnt work correctly. 

When I use topN parameter,  it is divided by map task counts (topN/ maptaskcounts = maptasktopN) Every map tasks generate maptasktopN urls of map tasks. Assume as I have 25 map tasks and I set topN parameter as 1000 and maptasktopN is calculated as 40. As Result We dont have top 1000 highest scored urls, we have 1000 urls of generated 40 highest scored urls per 25 map tasks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)