You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Ankit Kamboj (JIRA)" <ji...@apache.org> on 2012/06/20 00:44:42 UTC

[jira] [Created] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Ankit Kamboj created MAPREDUCE-4354:
---------------------------------------

             Summary: Performance improvement with compressor object reinit restriction
                 Key: MAPREDUCE-4354
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 0.20.205.0
            Reporter: Ankit Kamboj
             Fix For: 0.20.205.0


HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankit Kamboj updated MAPREDUCE-4354:
------------------------------------

    Component/s: performance
    
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405234#comment-13405234 ] 

Ankit Kamboj commented on MAPREDUCE-4354:
-----------------------------------------

This makes sense. The reinit for LZO should only reset buffers instead of recreating them.
So, I have removed the 'if' condition for GZIP in CodecPool and instead changed reinit logic in LZOCompressor.
Please find attached the corresponding patch.
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankit Kamboj updated MAPREDUCE-4354:
------------------------------------

    Priority: Minor  (was: Major)
    
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405288#comment-13405288 ] 

Robert Joseph Evans commented on MAPREDUCE-4354:
------------------------------------------------

I am not an expert on the compression so I don't feel confident giving it a +1 right now, but the patch looks OK to me.  Do you have the results of running test-patch on this?  Do you have plans to port this to trunk/branch-2?
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405406#comment-13405406 ] 

Ankit Kamboj commented on MAPREDUCE-4354:
-----------------------------------------

Thanks! 
I performed tests with this patch by running hive query on same dataset. Following are results of the latest test:

Map execution times:
1. Without patch: 9.22 mins
2. With  patch: 6.42 mins (43% less than without patch)

The overall (map + reduce) execution time:
1. Without patch: 14.61 mins
2. With  patch: 11.85 mins (23% less than without patch)


 
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankit Kamboj updated MAPREDUCE-4354:
------------------------------------

    Attachment: modify_lzo_codec_reinit

Modified the LZO compressor reinit for performance improvement.
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417337#comment-13417337 ] 

Robert Joseph Evans commented on MAPREDUCE-4354:
------------------------------------------------

The test results look great to me, but my comment about contributing this to trunk is off base.  My ignorance is showing :).  The LZO compression libraries that you modified are not hosted here.

You need to look at 

http://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/?redir=1

or

https://github.com/omalley/hadoop-gpl-compression

And email the dev list there.  Owen O'Mally is probably the right person to talk to there about getting this patch in.  Once it is in it should work both on trunk and 0.20.205
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankit Kamboj updated MAPREDUCE-4354:
------------------------------------

    Attachment: codec_reinit_diff
    
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417305#comment-13417305 ] 

Ankit Kamboj commented on MAPREDUCE-4354:
-----------------------------------------

Hi Robert,

Does the test results look good?
On your question "Do you have plans to port this to trunk/branch-2?", what would be the process to do that?

Thanks!
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423894#comment-13423894 ] 

Robert Joseph Evans commented on MAPREDUCE-4354:
------------------------------------------------

I am not, what you can do is to create your own github account, clone Owen's, GPL compression library, push you changes, and file a pull request.  I don't know what else to do. I'll try to reach out to him too.
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397546#comment-13397546 ] 

Robert Joseph Evans commented on MAPREDUCE-4354:
------------------------------------------------

Wouldn't it be cleaner to stub out reinit for LZO if reinit is not needed for it?
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4354) Performance improvement with compressor object reinit restriction

Posted by "Ankit Kamboj (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423460#comment-13423460 ] 

Ankit Kamboj commented on MAPREDUCE-4354:
-----------------------------------------

Thanks Robert!

I tried contacting Owen but haven't heard back. Are you aware of a wider email list that I can send the mail to?
                
> Performance improvement with compressor object reinit restriction
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 0.20.205.0
>            Reporter: Ankit Kamboj
>            Priority: Minor
>              Labels: performance
>             Fix For: 0.20.205.0
>
>         Attachments: codec_reinit_diff, modify_lzo_codec_reinit
>
>
> HADOOP-5879 patch aimed at picking the conf (instead of default) settings for GzipCodec. It also involved re-initializing the recycled compressor object. 
> On our performance tests, this re-initialization led to performance degradation of 15% for LzoCodec because re-initialization for Lzo involves reallocation of buffers. LzoCodec takes the initial settings from config so it is not necessary to re-initialize it. This patch checks for the codec class and calls reinit only if the codec class is Gzip. This led to significant performance improvement of 15% for LzoCodec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira