You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ravi Prakash (JIRA)" <ji...@apache.org> on 2016/05/04 01:50:13 UTC
[jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.

    [ https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269857#comment-15269857 ] 

Ravi Prakash commented on HADOOP-8065:
--------------------------------------

Thanks for the patch [~snayakm]! Here are some of my thoughts:

# What users seem to want, is to be able to compress data *during transit*. {color:red}*This patch does not enable compression of data during transit.*{color} Distcp is simply an MR job where maps are reading from a "source" . If the source does not support compressing the data before putting it on the network, I don't see how we could achieve what these users want.
# *We are simply enabling users to avoid a post-processing step to compress the data they have already transferred*. This too is a noble goal if it makes the lives of users easier IMHO. It also reduces the amount of space needed on the target filesystem. We should rewrite the JIRA summary to be more explicit if that is the stated goal.

Reviewing the patch:
# Do you really need the changes in {{CopyMapper}}?
# Nit: {{getCompressionCodcec}} is misspelt
# Instead of {code}      e.printStackTrace();
      LOG.error("Compression class " + compressionCodecClass
          + " not found in classpath");{code} you can simply pass {{e}} as a second argument to the LOG.error method.
# With this patch, we'll end up creating an instance of a Codec for every file. Do you think we could utilize something like {{org.apache.hadoop.io.compress.CodecPool}}?
# Perhaps we can add an option {{-compressOutput}} which defaults to some codec?
# Although its conceivable that we may want to decompress before writing to the target filesystem, we can punt that to another JIRA.

Thanks for your efforts! :-)

> distcp should have an option to compress data while copying.
> ------------------------------------------------------------
>
>                 Key: HADOOP-8065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8065
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.20.2
>            Reporter: Suresh Antony
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>             Fix For: 0.20.2
>
>         Attachments: HADOOP-8065-trunk_2015-11-03.patch, HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, patch.distcp.2012-02-10
>
>
> We would like compress the data while transferring from our source system to target system. One way to do this is to write a map/reduce job to compress that after/before being transferred. This looks inefficient. 
> Since distcp already reading writing data it would be better if it can accomplish while doing this. 
> Flip side of this is that distcp -update option can not check file size before copying data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file depending on compression type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org