You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/08/03 14:50:52 UTC

[jira] Created: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

ParseData's contentMeta accumulates unnecessary values during parse
-------------------------------------------------------------------

                 Key: NUTCH-535
                 URL: https://issues.apache.org/jira/browse/NUTCH-535
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: Doğacan Güney
            Assignee: Doğacan Güney
             Fix For: 1.0.0


After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-535.
-------------------------------


Resolved and committed.

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch, NUTCH_535_v2.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-535.
---------------------------------

    Resolution: Fixed

Fixed in rev. 563777.


> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch, NUTCH_535_v2.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517580 ] 

Andrzej Bialecki  commented on NUTCH-535:
-----------------------------------------

I also noticed (in a job unrelated to Nutch) that Hadoop sometimes re-uses the same object instance, so this problem may appear in other places too. We should review all Nutch Writable-s that they are properly reset to their initial state when readFields is invoked.

Minor nit: instead of creating new Metadata in readFields we can invoke metadata.clear(). This way we don't create new Metadata objects, and still get an empty metadata instance.

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-535:
--------------------------------

    Attachment: NUTCH_535_v2.patch

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch, NUTCH_535_v2.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517836 ] 

Doğacan Güney commented on NUTCH-535:
-------------------------------------

> I also noticed (in a job unrelated to Nutch) that Hadoop sometimes re-uses the same object instance, so this problem may 
> appear in other places too. We should review all Nutch Writable-s that they are properly reset to their initial state when
> readFields is invoked.

I skimmed through a couple of classes and didn't see any problems, but I saw a few places where we can use this caching behaviour to our advantage (for example, in CrawlDatum.readFields we can eliminate the metaData != null check)

> Minor nit: instead of creating new Metadata in readFields we can invoke metadata.clear(). This way we don't create new 
> Metadata objects, and still get an empty metadata instance.

I will send an updated patch that adds a clear method to metadata and uses it in Content and ParseData. I will also update CrawlDatum.readFields to clear MapWritable instead of creating a new one.

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch, NUTCH_535_v2.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518629 ] 

Hudson commented on NUTCH-535:
------------------------------

Integrated in Nutch-Nightly #175 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/175/])

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch, NUTCH_535_v2.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-535:
--------------------------------

    Attachment: NUTCH-535.patch

Patch for the problem. Moves metadata creation from Content.Content to Content.readFields.

> ParseData's contentMeta accumulates unnecessary values during parse
> -------------------------------------------------------------------
>
>                 Key: NUTCH-535
>                 URL: https://issues.apache.org/jira/browse/NUTCH-535
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-535.patch
>
>
> After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.