You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/10/16 00:08:05 UTC
[jira] [Updated] (NUTCH-2143) GeneratorJob ignores batch id passed
as argument
[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2143:
-----------------------------------
Description:
The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:
{noformat}
bin/nutch generate ... -batchId 1444941073-14208
...
GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
Fetching :
bin/nutch fetch ... 1444941073-14208 ...
...
QueueFeeder finished: total 0 records. Hit by time limit :0
{noformat}
The generated URLs are marked with the wrong batch id:
{noformat}
hbase(main):010:0> scan 'test_webpage'
ROW COLUMN+CELL
org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668
...
org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
{noformat}
and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].
was:
The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:
{noformat}
bin/nutch generate ... -batchId 1444941073-14208
...
GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
Fetching :
bin/nutch fetch ... 1444941073-14208 ...
...
QueueFeeder finished: total 0 records. Hit by time limit :0
{noformat}
The generated URLs are marked with the wrong batch id:
{noformat}
hbase(main):010:0> scan 'test_webpage'
ROW COLUMN+CELL
org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668
...
org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
{noformat}
and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html],[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].
> GeneratorJob ignores batch id passed as argument
> ------------------------------------------------
>
> Key: NUTCH-2143
> URL: https://issues.apache.org/jira/browse/NUTCH-2143
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 2.3.1
> Reporter: Sebastian Nagel
> Priority: Blocker
> Fix For: 2.3.1
>
>
> The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching :
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROW COLUMN+CELL
> org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668
> ...
> org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)