You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/06/09 12:11:00 UTC
[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export
metadata primitive data types correctly
[ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129203#comment-17129203 ]
Sebastian Nagel commented on NUTCH-2787:
----------------------------------------
Confirmed. Thanks, [~pmezard]! I'll try to provide a fix soon.
> CrawlDb JSON dump does not export metadata primitive data types correctly
> -------------------------------------------------------------------------
>
> Key: NUTCH-2787
> URL: https://issues.apache.org/jira/browse/NUTCH-2787
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.17
> Environment: Reproduced with:
> {code:java}
> commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD)
> Merge: e61a8a3b f971ca1b
> Author: Sebastian Nagel <sn...@apache.org>
> Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
>
> NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file {code}
> Reporter: Patrick Mézard
> Priority: Minor
> Fix For: 1.17
>
>
> To reproduce:
> * Activate scoring-depth plugin
> * Create a new crawldb from a seed URL:
> * Dump the crawldb as json
> * Look at the json
> {code:java}
> $ nutch inject crawl/crawldb seeds.txt
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
> $ cat out/part-r-00000 | head -1 | python -m json.tool
> {
> "url": "http://example.com/",
> "statusCode": 1,
> "statusName": "db_unfetched",
> "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
> "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
> "retriesSinceFetch": 0,
> "retryIntervalSeconds": 2592000,
> "retryIntervalDays": 30,
> "score": 1.0,
> "signature": "null",
> "metadata": {
> "_depth_": {},
> "_maxdepth_": {}
> }
> }{code}
> KO => `__depth__` and `__maxdepth__` are not integer.
> The fields are correct in the crawldb, as shown by a CSV dump:
> {code:java}
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
> $ cat out/part-r-00000
> Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
> "http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||" {code}
> Code is here:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269]
> I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).
> One fix might be to:
> * Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
> * Call that in the metadata conversion loop.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)