You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Ken Krugler <kk...@transpac.com> on 2011/09/05 00:49:51 UTC

Controlling compression during import

Hi there,

The current documentation says:
> By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to both SequenceFiles or text files.
> 
But I think this is a bit misleading.

Currently if output compression is enabled in a cluster, then the Sqooped data is alway compressed, regardless of the setting of this flag.

It seems better to actually make compression controllable via --compress, which means changing ImportJobBase.configureOutputFormat()

    if (options.shouldUseCompression()) {
      FileOutputFormat.setCompressOutput(job, true);
      FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
      SequenceFileOutputFormat.setOutputCompressionType(job,
          CompressionType.BLOCK);
    }
   // new stuff
    else {
      FileOutputFormat.setCompressOutput(job, false);
    }

Thoughts?

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Controlling compression during import

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 6, 2011, at 6:58am, Kate Ting wrote:

> Hi Ken, you make some good points, to which I've added comments individually.
> 
> re: the degree of parallelism during the next step of processing is
> constrained by the number of mappers used during sqooping: does
> https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
> might want to add your comments there.

Thanks for the ref, and yes that would help.

> re: winding up with unsplittable files and heavily skewed sizes: you
> can file separate JIRAs for those if desired.

That's not an issue for Sqoop - rather just how Hadoop works.

> re: partitioning isn't great: for some databases such as Oracle, the
> problem of heavily skewed sizes can be overcome using row-ids, you can
> file a JIRA for that if you feel it's needed.

Again, not really a Sqoop issue. Things are fine with OraOop.

When we fall back to regular Sqoop, we don't have a good column to use for partitioning, so the results wind up being heavily skewed. But I don't think there's anything Sqoop could do to easily solve that problem.

Regards,

-- Ken


> On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
> <kk...@transpac.com> wrote:
>> 
>> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>> 
>>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kk...@transpac.com> wrote:
>>>> Hi there,
>>>> The current documentation says:
>>>> 
>>>> By default, data is not compressed. You can compress your data by using the
>>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>>> Hadoop compression codec using the --compression-codec argument. This
>>>> applies to both SequenceFiles or text files.
>>>> 
>>>> But I think this is a bit misleading.
>>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>>> data is alway compressed, regardless of the setting of this flag.
>>>> It seems better to actually make compression controllable via --compress,
>>>> which means changing ImportJobBase.configureOutputFormat()
>>>>     if (options.shouldUseCompression()) {
>>>>       FileOutputFormat.setCompressOutput(job, true);
>>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>>           CompressionType.BLOCK);
>>>>     }
>>>>    // new stuff
>>>>     else {
>>>>       FileOutputFormat.setCompressOutput(job, false);
>>>>     }
>>>> Thoughts?
>>> 
>>> This is a good point Ken. However, IMO it is better left as is since
>>> there may be a wider cluster management policy in effect that requires
>>> compression for all output files. One way to look at it is that for
>>> normal use, there is a predefined compression scheme configured
>>> cluster wide, and occasionally when required, Sqoop users can use a
>>> different scheme where necessary.
>> 
>> The problem is that when you use text files as Sqoop output, these get compressed at the file level by (typically) deflate, gzip or lzo.
>> 
>> So you wind up with unsplittable files, which means that the degree of parallelism during the next step of processing is constrained by the number of mappers used during sqooping. But you typically set the number of mappers based on DB load & size of the data set.
>> 
>> And if partitioning isn't great, then you also wind up with heavily skewed sizes for these unsplittable files, which makes things even worse.
>> 
>> The current work-around is to use binary or Avro output instead of text, but that's an odd requirement to be able to avoid the above problem.
>> 
>> If the argument is to avoid implicitly changing the cluster's default compression policy, then I'd suggest supporting a -nocompression flag.
>> 
>> Regards,
>> 
>> -- Ken
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Controlling compression during import

Posted by Kate Ting <ka...@cloudera.com>.
Hi Ken, you make some good points, to which I've added comments individually.

re: the degree of parallelism during the next step of processing is
constrained by the number of mappers used during sqooping: does
https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
might want to add your comments there.

re: winding up with unsplittable files and heavily skewed sizes: you
can file separate JIRAs for those if desired.

re: partitioning isn't great: for some databases such as Oracle, the
problem of heavily skewed sizes can be overcome using row-ids, you can
file a JIRA for that if you feel it's needed.

Regards, Kate

On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
<kk...@transpac.com> wrote:
>
> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>
>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kk...@transpac.com> wrote:
>>> Hi there,
>>> The current documentation says:
>>>
>>> By default, data is not compressed. You can compress your data by using the
>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>> Hadoop compression codec using the --compression-codec argument. This
>>> applies to both SequenceFiles or text files.
>>>
>>> But I think this is a bit misleading.
>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>> data is alway compressed, regardless of the setting of this flag.
>>> It seems better to actually make compression controllable via --compress,
>>> which means changing ImportJobBase.configureOutputFormat()
>>>     if (options.shouldUseCompression()) {
>>>       FileOutputFormat.setCompressOutput(job, true);
>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>           CompressionType.BLOCK);
>>>     }
>>>    // new stuff
>>>     else {
>>>       FileOutputFormat.setCompressOutput(job, false);
>>>     }
>>> Thoughts?
>>
>> This is a good point Ken. However, IMO it is better left as is since
>> there may be a wider cluster management policy in effect that requires
>> compression for all output files. One way to look at it is that for
>> normal use, there is a predefined compression scheme configured
>> cluster wide, and occasionally when required, Sqoop users can use a
>> different scheme where necessary.
>
> The problem is that when you use text files as Sqoop output, these get compressed at the file level by (typically) deflate, gzip or lzo.
>
> So you wind up with unsplittable files, which means that the degree of parallelism during the next step of processing is constrained by the number of mappers used during sqooping. But you typically set the number of mappers based on DB load & size of the data set.
>
> And if partitioning isn't great, then you also wind up with heavily skewed sizes for these unsplittable files, which makes things even worse.
>
> The current work-around is to use binary or Avro output instead of text, but that's an odd requirement to be able to avoid the above problem.
>
> If the argument is to avoid implicitly changing the cluster's default compression policy, then I'd suggest supporting a -nocompression flag.
>
> Regards,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>

Re: Controlling compression during import

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:

> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kk...@transpac.com> wrote:
>> Hi there,
>> The current documentation says:
>> 
>> By default, data is not compressed. You can compress your data by using the
>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>> Hadoop compression codec using the --compression-codec argument. This
>> applies to both SequenceFiles or text files.
>> 
>> But I think this is a bit misleading.
>> Currently if output compression is enabled in a cluster, then the Sqooped
>> data is alway compressed, regardless of the setting of this flag.
>> It seems better to actually make compression controllable via --compress,
>> which means changing ImportJobBase.configureOutputFormat()
>>     if (options.shouldUseCompression()) {
>>       FileOutputFormat.setCompressOutput(job, true);
>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>           CompressionType.BLOCK);
>>     }
>>    // new stuff
>>     else {
>>       FileOutputFormat.setCompressOutput(job, false);
>>     }
>> Thoughts?
> 
> This is a good point Ken. However, IMO it is better left as is since
> there may be a wider cluster management policy in effect that requires
> compression for all output files. One way to look at it is that for
> normal use, there is a predefined compression scheme configured
> cluster wide, and occasionally when required, Sqoop users can use a
> different scheme where necessary.

The problem is that when you use text files as Sqoop output, these get compressed at the file level by (typically) deflate, gzip or lzo.

So you wind up with unsplittable files, which means that the degree of parallelism during the next step of processing is constrained by the number of mappers used during sqooping. But you typically set the number of mappers based on DB load & size of the data set.

And if partitioning isn't great, then you also wind up with heavily skewed sizes for these unsplittable files, which makes things even worse.

The current work-around is to use binary or Avro output instead of text, but that's an odd requirement to be able to avoid the above problem.

If the argument is to avoid implicitly changing the cluster's default compression policy, then I'd suggest supporting a -nocompression flag.

Regards,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Controlling compression during import

Posted by Arvind Prabhakar <ar...@apache.org>.
On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <kk...@transpac.com> wrote:
> Hi there,
> The current documentation says:
>
> By default, data is not compressed. You can compress your data by using the
> deflate (gzip) algorithm with the -z or --compress argument, or specify any
> Hadoop compression codec using the --compression-codec argument. This
> applies to both SequenceFiles or text files.
>
> But I think this is a bit misleading.
> Currently if output compression is enabled in a cluster, then the Sqooped
> data is alway compressed, regardless of the setting of this flag.
> It seems better to actually make compression controllable via --compress,
> which means changing ImportJobBase.configureOutputFormat()
>     if (options.shouldUseCompression()) {
>       FileOutputFormat.setCompressOutput(job, true);
>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>       SequenceFileOutputFormat.setOutputCompressionType(job,
>           CompressionType.BLOCK);
>     }
>    // new stuff
>     else {
>       FileOutputFormat.setCompressOutput(job, false);
>     }
> Thoughts?

This is a good point Ken. However, IMO it is better left as is since
there may be a wider cluster management policy in effect that requires
compression for all output files. One way to look at it is that for
normal use, there is a predefined compression scheme configured
cluster wide, and occasionally when required, Sqoop users can use a
different scheme where necessary.

Thanks,
Arvind


> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>