You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/08/03 18:53:53 UTC

[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

    [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517580 ] 

Andrzej Bialecki  commented on NUTCH-535:
-----------------------------------------

I also noticed (in a job unrelated to Nutch) that Hadoop sometimes re-uses the same object instance, so this problem may appear in other places too. We should review all Nutch Writable-s that they are properly reset to their initial state when readFields is invoked.

Minor nit: instead of creating new Metadata in readFields we can invoke metadata.clear(). This way we don't create new Metadata objects, and still get an empty metadata instance.

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.