You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Johan Oskarsson (JIRA)" <ji...@apache.org> on 2008/11/12 14:57:44 UTC

[jira] Created: (HADOOP-4640) Add ability to split text files compressed with lzo

Add ability to split text files compressed with lzo
---------------------------------------------------

                 Key: HADOOP-4640
                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
             Project: Hadoop Core
          Issue Type: Improvement
          Components: io, mapred
            Reporter: Johan Oskarsson
            Priority: Trivial


Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Fix Version/s: 0.20.0
           Status: Patch Available  (was: Open)

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647139#action_12647139 ] 

Chris Douglas commented on HADOOP-4640:
---------------------------------------

Good idea
* On LzopCodec: Removing the unused bufferSize field is clearly useful. The condition protected against by decompressedWholeBlock is best left to close() and not verifyChecksum, though... right? It might be better if this were to finish reading the block and verify the checksum rather than ignoring it.
* LzopCodec was removed from the default list of codecs, per HADOOP-4030
* +1 for an OutputFormat
* The size of each block (including checksums) depends on the codecs specified in the header; LzoTextInputFormat::index assumes only one checksum per block, which may not be the case:
{noformat}
+        is.seek(pos + compressedBlockSize + 4); // crc int?
{noformat}
* Each RecordReader doesn't need to slurp and sort the full index. If each FileSplit were guaranteed to point to the beginning of a block, all the splits could be generated by the client using the index.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649179#action_12649179 ] 

Chris Douglas commented on HADOOP-4640:
---------------------------------------

+1 Patch looks good.

You might want to try running test-patch and the unit tests on your machine; Hudson looks backed up.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Updated patch with most of the suggestions incorporated.
* Will continue if the index is missing with the whole file as one split
* Will only skip verifying the checksums in the close method if we haven't decompressed the whole block. That block will be verified by another split later anyway.
* Removed lzop from the codecs list in the config
* The indexer method is now aware of the number of checksum algorithms used so it seeks to the next block properly
* Changed the unit test to write a lzop compressed file, index and read it back again
* As suggested the RecordReaders don't have to read the index, it's done when getting the splits instead

I haven't done any work on an output format, I'd rather leave that for another ticket since it will require more extensive modifications of the compression classes. The option I'm leaning towards is to register a class that implements an Indexer interface in the stream classes (LzopOutputStream and BlockCompressorStream).

As before this will give one findbugs error.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650119#action_12650119 ] 

Hadoop QA commented on HADOOP-4640:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12394241/HADOOP-4640.patch
  against trunk revision 719787.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3642/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3642/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3642/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3642/console

This message is automatically generated.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment:     (was: HADOOP-4622.patch)

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649392#action_12649392 ] 

Johan Oskarsson commented on HADOOP-4640:
-----------------------------------------

Local test-patch gives one findbugs error as expected (synchronization). All unit tests pass.
There is a hudson run queued up for the previous version of the patch, not sure how to cancel that.

     [exec] -1 overall.                                                                                                                                                                                         
     [exec]                                                                                                                                                                                                     
     [exec]     +1 @author.  The patch does not contain any @author tags.                                                                                                                                       
     [exec]                                                                                                                                                                                                     
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.                                                                                                                       
     [exec]                                                                                                                                                                                                     
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.                                                                                                                            
     [exec]                                                                                                                                                                                                     
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.                                                                                                     
     [exec]                                                                                                                                                                                                     
     [exec]     -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.                                                                                                                           
     [exec]                                                                                                                                                                                                     
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.                                                                                                                            
     [exec]                        


> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646985#action_12646985 ] 

Doug Cutting commented on HADOOP-4640:
--------------------------------------

> What is our policy on this?

I don't know that we have a clear policy.  In this case, I think it would be fine for the tests to succeed with a warning if native code is not available.  Ideally we should have tests that are only run when native code is available.

A few questions:
 - Should the InputFormat require the index, as in your patch, or rather should it degrade gracefully, so that if indexes do not exist it creates a single split per file?
 - It would be great to have an OutputFormat that creates indexes as files are written.  Is that possible?


> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649638#action_12649638 ] 

Johan Oskarsson commented on HADOOP-4640:
-----------------------------------------

I don't think it's worth calling getPos(), as you say it shouldn't cause any issues the way it is now. TextInputFormat does it in the same way.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649485#action_12649485 ] 

Chris Douglas commented on HADOOP-4640:
---------------------------------------

I hadn't noticed the findbugs warning. I suppose odd/stale/incorrect results from getProgress are mostly benign. Do you think it's worth calling getPos() from getProgress instead of using the pos field? That should resolve the findbugs warning.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4640:
----------------------------------

    Status: Open  (was: Patch Available)

bq. Will only skip verifying the checksums in the close method if we haven't decompressed the whole block. That block will be verified by another split later anyway.
The data is already decompressed, but it hasn't been read out of the codec's buffer. Adding a new, public method instead of calculating the checksum for the remainder of the buffered block seems like the wrong tradeoff. something like:
{code}
public void close() throws IOException {
  byte[] b = new byte[4096];
  while (!decompressor.finished()) {
    decompressor.decompress(b, 0, b.length);
  }
  super.close();
  verifyChecksums();
}
{code}
should work, right? Allocating in the close is less optimal than, say, passing the Checksum object to the codec, but this requires fewer changes to the interfaces.

* Using a TreeSet of Long seems unnecessary when the indices are sorted. Since the number of blocks stored in the index can be calculated from its length, a type wrapping a long[] seems more appropriate (the member function on said type can use Arrays::binarySearch instead of TreeSet::ceiling).
* It doesn't need to be part of this patch, but it's worth noting that splittable lzop inputs will create hot spots of the blocks storing the headers. If this were abstracted, then the split could be annotated with the properties of the file and the RecordReader initialized with block properties.
* The count of checksums should include both compressed and decompressed checksums.
* Instead of {{pos + 8}} in createIndex, it would make more sense to record the position in the stream after reading the two ints (so skipping the block uses the more readable {{pos + compressedBlockSize + 4 * numChecksums}}).
* The only termination condition in LzoTextInputFormat::createIndex is uncompressedBlockSize == 0. Values < 0 for uncompressedBlockSize should throw EOFException while values <= 0 for compressedBlockSize should throw IOException.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Have added unit test for the LzoIndex issue, seems the other test still worked since the splits just got shifted one block. Corrected the method to work as described now.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650671#action_12650671 ] 

Hudson commented on HADOOP-4640:
--------------------------------

Integrated in Hadoop-trunk #670 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/670/])
    . Adds an input format that can split lzo compressed text files. (johan)


> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Status: Patch Available  (was: Open)

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson reassigned HADOOP-4640:
---------------------------------------

    Assignee: Johan Oskarsson

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this.
The test failure isn't related to this patch. I created this issue for it HADOOP-4716

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-4640:
----------------------------------

    Status: Open  (was: Patch Available)

Canceling until Chris' comments are addressed.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Hadoop Flags: [Reviewed]
          Status: Patch Available  (was: Open)

Submitting to hudson now that the queue has disappeared.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Status: Open  (was: Patch Available)

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648887#action_12648887 ] 

Chris Douglas commented on HADOOP-4640:
---------------------------------------

bq. As for the close() I did as suggested, although it rubs me the wrong way to read all those bytes without needing to. I guess the practical performance impact will be minimal though.
It's only calculating a checksum of the remaining bytes from a direct buffer. For the default 64k block, I'd guess it adds somewhere between 20 and 50ms in the close. If it had to make another trip to the native code, I agree that would be improper, but this should be a trivial cost. 

I'm not sure I follow LzoIndex::findIndexPosition. Given {{\{0, 5, 10, 15\}}} as block positions, findIndexPosition(1) will return 10, but findIndexPosition(5) returns 5. Should the former case also return 5? findIndexPosition(11) returns -1, which also seems contrary to its javadoc explanation.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Replaced the TreeSet with long[], fixed the incorrect checksum count, fixed the indexer loop termination.
As for the close() I did as suggested, although it rubs me the wrong way to read all those bytes without needing to. I guess the practical performance impact will be minimal though.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Previous file was the patch

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Status: Patch Available  (was: Open)

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646899#action_12646899 ] 

Johan Oskarsson commented on HADOOP-4640:
-----------------------------------------

I've got a working input format that can split lzo files. It requires an index of the file to be created before the lzo file can be split.
In a fairly non scientific experiment I got 30% performance increase using this compared to reading the uncompressed files on our 20 node cluster.

I need to clean up and test the code a bit before I'll submit it.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4622.patch

First version of the lzo splittable input format. Please review.

I decided to write a unit test that doesn't require the lzo native libs to be loaded, it's not ideal but it works.
The other option would be to write one that needs the native libs and otherwise doesn't run the test. What is our policy on this?

There will probably be one findbugs error, the same one exists in the LineRecordReader I based this off, I don't think it will cause any harm.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>         Attachments: HADOOP-4622.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647257#action_12647257 ] 

Hadoop QA commented on HADOOP-4640:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12393793/HADOOP-4640.patch
  against trunk revision 713612.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3583/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3583/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3583/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3583/console

This message is automatically generated.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame since the lzo algorithm would be very suitable for large log files and similar common hadoop data sets. The compression rate is not the best out there but the decompression speed is amazing.  Since lzo writes compressed data in blocks it would be possible to make an input format that can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.