You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/06/09 12:11:00 UTC

[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

    [ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129203#comment-17129203 ] 

Sebastian Nagel commented on NUTCH-2787:
----------------------------------------

Confirmed. Thanks, [~pmezard]! I'll try to provide a fix soon.

> CrawlDb JSON dump does not export metadata primitive data types correctly
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-2787
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2787
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.17
>         Environment: Reproduced with:
> {code:java}
> commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD)
> Merge: e61a8a3b f971ca1b
> Author: Sebastian Nagel <sn...@apache.org>
> Date:   Thu May 14 17:43:14 2020 +0200    Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
>     
>     NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file {code}
>            Reporter: Patrick Mézard
>            Priority: Minor
>             Fix For: 1.17
>
>
> To reproduce:
>  * Activate scoring-depth plugin
>  * Create a new crawldb from a seed URL:
>  * Dump the crawldb as json
>  * Look at the json
> {code:java}
> $ nutch inject crawl/crawldb seeds.txt
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
> $ cat out/part-r-00000 | head -1 | python -m json.tool
> {
>     "url": "http://example.com/",
>     "statusCode": 1,
>     "statusName": "db_unfetched",
>     "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
>     "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
>     "retriesSinceFetch": 0,
>     "retryIntervalSeconds": 2592000,
>     "retryIntervalDays": 30,
>     "score": 1.0,
>     "signature": "null",
>     "metadata": {
>         "_depth_": {},
>         "_maxdepth_": {}
>     }
> }{code}
> KO => `__depth__` and `__maxdepth__` are not integer.
> The fields are correct in the crawldb, as shown by a CSV dump:
> {code:java}
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
> $ cat out/part-r-00000 
> Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
> "http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||" {code}
> Code is here:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269]
> I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).
> One fix might be to:
>  * Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
>  * Call that in the metadata conversion loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)