You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2019/11/22 15:46:01 UTC

[jira] [Resolved] (NUTCH-2003) topN is not work correctly

     [ https://issues.apache.org/jira/browse/NUTCH-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel resolved NUTCH-2003.
------------------------------------
    Fix Version/s: 2.5
       Resolution: Auto Closed

Closing 2.5 issues as branch is no longer maintained.

> topN is not work correctly
> --------------------------
>
>                 Key: NUTCH-2003
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2003
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Talat Uyarer
>            Priority: Minor
>             Fix For: 2.5
>
>
> I want to crawl top 1000 urls which are ordered by scores from webpage table. It doesnt work correctly. 
> When I use topN parameter,  it is divided by map task counts (topN/ maptaskcounts = maptasktopN) Every map tasks generate maptasktopN urls of map tasks. Assume as I have 25 map tasks and I set topN parameter as 1000 and maptasktopN is calculated as 40. As Result We dont have top 1000 highest scored urls, we have 1000 urls of generated 40 highest scored urls per 25 map tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)