You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Aaron Kimball (JIRA)" <ji...@apache.org> on 2009/09/22 02:31:15 UTC

[jira] Created: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Compression and output splitting for Sqoop
------------------------------------------

                 Key: MAPREDUCE-1017
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Components: contrib/sqoop
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball


Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765908#action_12765908 ] 

Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------

Hudson-failing test passes locally; should be unrelated (Sqoop is a contrib, after all). 

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Open  (was: Patch Available)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Open  (was: Patch Available)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Patch Available  (was: Open)

Recycling; hudson seems to have dropped this from the queue.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Patch Available  (was: Open)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Patch Available  (was: Open)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766013#action_12766013 ] 

Tom White commented on MAPREDUCE-1017:
--------------------------------------

Overall this looks like a good addition. A few comments:
* HdfsSplitOutputStream doesn't seem to be HDFS-specific, so should be renamed (to BlockSplitOutputStream?). In principle it could be used with another block-based filesystem, like S3FileSystem.
* HdfsSplitOutputStream uses CountingOutputStream to keep track of how many bytes have been written. Could you use FSDataOutputStream#getPos() for this? (Also, there's a CountingOutputStream in Apache Commons IO, which we already depend on.)
* It would be good to support more than just gzip compression in HdfsSplitOutputStream. The machinery in org.apache.hadoop.io.compress should make this relatively straightforward.
* It would be good to have a unit test for HdfsSplitOutputStream.
* Formatting nits: there are a few redundant imports, and some lines are greater than 80 characters.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Open  (was: Patch Available)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766306#action_12766306 ] 

Hadoop QA commented on MAPREDUCE-1017:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12422269/MAPREDUCE-1017.3.patch
  against trunk revision 825469.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 8 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/173/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/173/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/173/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/173/console

This message is automatically generated.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767612#action_12767612 ] 

Hadoop QA commented on MAPREDUCE-1017:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12422608/MAPREDUCE-1017.4.patch
  against trunk revision 826767.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 8 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/188/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/188/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/188/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/188/console

This message is automatically generated.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766174#action_12766174 ] 

Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------

Thanks for the review. Some responses:

* You're correct about the name of HdfsSplitOutputStream. Will change.
* Thanks for the pointer about existing counting streams.
* I agree that this should eventually support multiple compression codecs, but the burden is on the application to select the correct codec based on the intended file extension. That would add even more code to this ticket; I'll move toward support for additional codecs (bz2, etc.) in a subsequent JIRA.
* Do you think a separate test is necessary for HdfsSplitOutputStream? This is tested through TestSplittableBufferedWriter. The SplittableBufferedWriter and SplitOutputStream classes are pretty tightly coupled -- SplittableBufferedWriter does virtually nothing but wrap the OutputStream in a BufferedWriter. {{testSplittingTextFile()}}, for example, only passes because of {{HdfsSplitOutputStream.openNextFile()}}.
* I'll clean up the formatting a bit in the next patch.

Will submit a new patch soon.


> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765874#action_12765874 ] 

Hadoop QA commented on MAPREDUCE-1017:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12422162/MAPREDUCE-1017.2.patch
  against trunk revision 825083.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 8 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/170/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/170/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/170/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/170/console

This message is automatically generated.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765353#action_12765353 ] 

Hadoop QA commented on MAPREDUCE-1017:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12420242/MAPREDUCE-1017.patch
  against trunk revision 824750.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 8 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/70/console

This message is automatically generated.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-1017:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.22.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I've just committed this. Thanks Aaron!

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758087#action_12758087 ] 

Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------

In addition to the unit tests added in the {{org.apache.hadoop.sqoop.io}} package, I also performed a larger-scale test of this functionality. A 1.5 GB table was imported from MySQL to HDFS; the data was highly redundant, so compression shrank the files considerably, as well improved as the import time. The arguments {{\-z \-\-direct-split-size 25000000}} was given, so that it would generate approximately 25 MB files. This worked, and three files were generated. I verified using {{head}} and {{tail}} that the files did not lose any records and that records did not span multiple files.

I also verified that {{\-\-direct-split-size}} worked without compression, which it does. 

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Patch Available  (was: Open)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Attachment: MAPREDUCE-1017.patch

This patch introduces two new features/arguments to Sqoop:

* Data can be compressed via {{\-\-compress}} / {{\-z}}. This will enable gzipping of text inputs
* Users can specify the approximate maximum file size used in direct mode with {{\-\-direct-split-size}}, which takes an argument in bytes, of the approximate file size to generate. After writing a record which surpasses this boundary, a new file is opened. Because Sqoop uses buffered writers, this file size is approximate, though Sqoop guarantees that new files will only be opened on record boundaries.

The compression argument applies to non-direct-mode imports as well. Sqoop will now use a compression codec for writing text files when using a MapReduce-based import. Sqoop used to call {{SequenceFileOutputFormat.setCompressionEnabled(true)}}by default; this will now only be the case if the user explicitly requests compression.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Attachment: MAPREDUCE-1017.2.patch

New patch sync'd to trunk.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Patch Available  (was: Open)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Attachment: MAPREDUCE-1017.4.patch

New patch includes changes to documentation to reflect new arguments added by this feature.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771501#action_12771501 ] 

Hudson commented on MAPREDUCE-1017:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #127 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/127/])
    

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Status: Open  (was: Patch Available)

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Attachment: MAPREDUCE-1017.3.patch

New patch implementing CR suggestions.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768734#action_12768734 ] 

Hudson commented on MAPREDUCE-1017:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #93 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/93/])
    . Compression and output splitting for Sqoop. Contributed by Aaron Kimball.


> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.3.patch, MAPREDUCE-1017.4.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.