You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/06/29 14:46:04 UTC

[jira] Created: (NUTCH-506) Nutch should delegate compression to Hadoop

Nutch should delegate compression to Hadoop
-------------------------------------------

                 Key: NUTCH-506
                 URL: https://issues.apache.org/jira/browse/NUTCH-506
             Project: Nutch
          Issue Type: Improvement
            Reporter: Doğacan Güney
             Fix For: 1.0.0


Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 

Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-506:
--------------------------------

    Attachment: NUTCH-506.patch

New version. I missed ProtocolStatus and ParseStatus. This patch updates them in a backward-compatible way.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-506:
--------------------------------

    Attachment: compress.patch

This patch changes Content (Content is no longer a CompressedWritable) and ParseText (from VersionedWritable(*) to Writable). These changes are backwards compatible. So old segments can still be read after this patch.

Patch also changes Content's public api very slightly. Content.forceInflate method is removed because it is no longer needed.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512011 ] 

Doğacan Güney commented on NUTCH-506:
-------------------------------------

For some reason, crawl_generate is not compressed, even though crawldb, crawl_parse and crawl_fetch are compressed. 

I tried "readseg -dump"ing an older 2000 url segment with this patch, and dump worked without problems.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-506.
-------------------------------


> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513044 ] 

Doğacan Güney commented on NUTCH-506:
-------------------------------------

If there are no objections, I am going to commit this one.

Just to get more comments, here is a break-down of what this patch does:

* Remove all compression code from nutch. This means no more writeCompressedString-s or writeCompressedStringArrays, also no more CompressedWritable-s. All changes are done in a backward compatible manner. Also after this change, Content's version is -1 and new changes should *decrease* that number. See NUTCH-392 for more details.

* Respect io.seqfile.compression.type setting for all structures except ParseText. ParseText is always compressed as RECORD. Also for some reason crawl_generate is not compressed.

Why are we doing this? Because hadoop can efficiently (both in space and in time) compress these structures for us. I have done some tests with different compression settings in NUTCH-392 and BLOCK compression really does a difference. I think for a large enough crawl, overall space savings will be around %20 - %40. Note that this is basically for free (there may even be a small performance gain) if you are using hadoop's native libraries.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-506.
---------------------------------

    Resolution: Fixed
      Assignee: Doğacan Güney

Committed in rev. 556946.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513428 ] 

Hudson commented on NUTCH-506:
------------------------------

Integrated in Nutch-Nightly #153 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/153/])

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch, NUTCH-506.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-506) Nutch should delegate compression to Hadoop

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509090 ] 

Doğacan Güney edited comment on NUTCH-506 at 6/29/07 5:50 AM:
--------------------------------------------------------------

This patch changes Content (Content is no longer a CompressedWritable) and ParseText (from VersionedWritable(*) to Writable). These changes are backwards compatible. So old segments can still be read after this patch.

Patch also changes Content's public api very slightly. Content.forceInflate method is removed because it is no longer needed.

(*) I don't understand how VersionedWritable works. AFAICS, there is no easy way to get what version you just read, so it is useless for data versioning.


 was:
This patch changes Content (Content is no longer a CompressedWritable) and ParseText (from VersionedWritable(*) to Writable). These changes are backwards compatible. So old segments can still be read after this patch.

Patch also changes Content's public api very slightly. Content.forceInflate method is removed because it is no longer needed.

> Nutch should delegate compression to Hadoop
> -------------------------------------------
>
>                 Key: NUTCH-506
>                 URL: https://issues.apache.org/jira/browse/NUTCH-506
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: compress.patch
>
>
> Some data structures within nutch (such as Content, ParseText) handle their own compression. We should delegate all compressions to Hadoop. 
> Also, nutch should respect io.seqfile.compression.type setting. Currently even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it for some structures and sets it to NONE (However, IMO, ParseText should always be compressed as RECORD because of performance reasons).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.