You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/09/12 01:12:22 UTC

[jira] Created: (HADOOP-522) MapFile should support block compression

MapFile should support block compression
----------------------------------------

                 Key: HADOOP-522
                 URL: http://issues.apache.org/jira/browse/HADOOP-522
             Project: Hadoop
          Issue Type: New Feature
          Components: io
            Reporter: Doug Cutting


MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-522) MapFile should support block compression

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-522?page=all ]

Doug Cutting resolved HADOOP-522.
---------------------------------

    Fix Version/s: 0.7.0
       Resolution: Fixed

I just committed this.

> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>             Fix For: 0.7.0
>
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-522) MapFile should support block compression

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-522?page=all ]

Doug Cutting updated HADOOP-522:
--------------------------------

    Attachment: block-compress-map-file.patch

This is a quick hack to try to test whether MapFile & SetFile will work with block compression.  It currently fails.

To illustrate the problem:

ant compile-core-test
bin/hadoop org.apache.hadoop.io.TestSetFile -local foo


> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-522) MapFile should support block compression

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-522?page=all ]

Doug Cutting updated HADOOP-522:
--------------------------------

    Attachment: block-compress-map-file.patch

Here's a candidate patch.  It:

1. Adds MapFile and SetFile constructors to specify compression type.
2. Fixes MapFile to work correctly with block compression.
3. Fixes SequenceFile to permit random accesses.
4. Cleans up some awkward code in SequenceFile#Reader#readBuffer().
5. Adds some javadoc clarifying a few API assumptions.
5. Extends the SetFile unit test to use block compression.

Can someone familiar with SequenceFile please review this?  Thanks!


> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Assigned: (HADOOP-522) MapFile should support block compression

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-522?page=all ]

Doug Cutting reassigned HADOOP-522:
-----------------------------------

    Assignee: Doug Cutting

With the patch for HADOOP-532, this now seems to work.  The attached code should still probably be cleaned up a bit, but the basic approach seems to now work.

> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-522) MapFile should support block compression

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-522?page=all ]

Doug Cutting updated HADOOP-522:
--------------------------------

    Attachment:     (was: block-compress-map-file.patch)

> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-522) MapFile should support block compression

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-522?page=comments#action_12439382 ] 
            
Arun C Murthy commented on HADOOP-522:
--------------------------------------

The changes to SequenceFile look good to me, the 'seek' is definitely better than reading those blocks - originally my bad.

Super-minor nit: There seems to be a small oversight in the javadoc for Writer#getLength, thanks.

> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-522) MapFile should support block compression

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-522?page=comments#action_12439621 ] 
            
Owen O'Malley commented on HADOOP-522:
--------------------------------------

It looks good other than Arun's nit about the getLength javadoc being for the wrong method. 

> MapFile should support block compression
> ----------------------------------------
>
>                 Key: HADOOP-522
>                 URL: http://issues.apache.org/jira/browse/HADOOP-522
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: block-compress-map-file.patch
>
>
> MapFile is layered on SequenceFile and permits random-access to sorted data files (typically reduce output) through a parallel index file.  This is used widely in Nutch (e.g. at search time for displaying cached pages, incoming links, etc).  Such sorted data should benefit from block compression, but the current MapFile API does not support specification of block compression.  Also, even if it did, the semantics of SequenceFile methods like seek() and getPosition() are changed under block compression so that MapFile may not work.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira