You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by samir das mohapatra <sa...@gmail.com> on 2013/06/12 06:07:09 UTC
Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for
mapreduce job ?
Regards,
samir.
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah I too found that quite slow and memory hungry !
Thanks,
Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> Rahul-da
>
> I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
> Thanks
> Sanjay
>
> From: Rahul Bhattacharjee <ra...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Now give .gz file as input to the MAP
>
> Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
> You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>> hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>> No special settings required for reading Gzip except these above
>>
>> I u want to output Gzip
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>> <!-- core-site.xml -->
>> <property>
>> <name>io.compression.codecs</name>
>> <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>> </property>
>>
>> I have a question
>>
>> Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>> Also in Non-splitable Snappy Codec is better
>>
>> Good Luck
>>
>>
>> sanjay
>>
>> From: samir das mohapatra <sa...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
>> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
>> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Now give .gz file as input to the MAP
>>
>> Hi All,
>> Did any one worked on, how to pass the .gz file as file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah I too found that quite slow and memory hungry !
Thanks,
Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> Rahul-da
>
> I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
> Thanks
> Sanjay
>
> From: Rahul Bhattacharjee <ra...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Now give .gz file as input to the MAP
>
> Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
> You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>> hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>> No special settings required for reading Gzip except these above
>>
>> I u want to output Gzip
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>> <!-- core-site.xml -->
>> <property>
>> <name>io.compression.codecs</name>
>> <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>> </property>
>>
>> I have a question
>>
>> Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>> Also in Non-splitable Snappy Codec is better
>>
>> Good Luck
>>
>>
>> sanjay
>>
>> From: samir das mohapatra <sa...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
>> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
>> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Now give .gz file as input to the MAP
>>
>> Hi All,
>> Did any one worked on, how to pass the .gz file as file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah I too found that quite slow and memory hungry !
Thanks,
Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> Rahul-da
>
> I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
> Thanks
> Sanjay
>
> From: Rahul Bhattacharjee <ra...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Now give .gz file as input to the MAP
>
> Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
> You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>> hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>> No special settings required for reading Gzip except these above
>>
>> I u want to output Gzip
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>> <!-- core-site.xml -->
>> <property>
>> <name>io.compression.codecs</name>
>> <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>> </property>
>>
>> I have a question
>>
>> Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>> Also in Non-splitable Snappy Codec is better
>>
>> Good Luck
>>
>>
>> sanjay
>>
>> From: samir das mohapatra <sa...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
>> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
>> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Now give .gz file as input to the MAP
>>
>> Hi All,
>> Did any one worked on, how to pass the .gz file as file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah I too found that quite slow and memory hungry !
Thanks,
Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> Rahul-da
>
> I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
> Thanks
> Sanjay
>
> From: Rahul Bhattacharjee <ra...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Now give .gz file as input to the MAP
>
> Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
> You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:
>
>> hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>> No special settings required for reading Gzip except these above
>>
>> I u want to output Gzip
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>> <!-- core-site.xml -->
>> <property>
>> <name>io.compression.codecs</name>
>> <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>> </property>
>>
>> I have a question
>>
>> Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>> Also in Non-splitable Snappy Codec is better
>>
>> Good Luck
>>
>>
>> sanjay
>>
>> From: samir das mohapatra <sa...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
>> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
>> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Now give .gz file as input to the MAP
>>
>> Hi All,
>> Did any one worked on, how to pass the .gz file as file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Rahul-da
I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast)
Thanks
Sanjay
From: Rahul Bhattacharjee <ra...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP
Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Rahul-da
I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast)
Thanks
Sanjay
From: Rahul Bhattacharjee <ra...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP
Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Rahul-da
I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast)
Thanks
Sanjay
From: Rahul Bhattacharjee <ra...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP
Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Rahul-da
I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast)
Thanks
Sanjay
From: Rahul Bhattacharjee <ra...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP
Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Nothing special is required for process .gz files using MR. however , as
Sanjay mentioned , verify the codec's configured in core-site and another
thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> hadoopConf.set("mapreduce.job.inputformat.class",
> "com.wizecommerce.utils.mapred.TextInputFormat");
>
> hadoopConf.set("mapreduce.job.outputformat.class",
> "com.wizecommerce.utils.mapred.TextOutputFormat");
> No special settings required for reading Gzip except these above
>
> I u want to output Gzip
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
> "org.apache.hadoop.io.compress.GzipCodec");
>
> Make sure Gzip codec is defined in core-site.xml
> <!-- core-site.xml -->
> <property>
> <name>io.compression.codecs</name>
> <value
> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
> value>
> </property>
>
> I have a question
>
> Why are u using GZIP as input to Map ? These are not splittable…Unless u
> have to read multilines (like lines between a BEGIN and END block in a log
> file) and send it as one record to the mapper
>
> Also in Non-splitable Snappy Codec is better
>
> Good Luck
>
>
> sanjay
>
> From: samir das mohapatra <sa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:07 PM
> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Now give .gz file as input to the MAP
>
> Hi All,
> Did any one worked on, how to pass the .gz file as file input for
> mapreduce job ?
>
> Regards,
> samir.
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Nothing special is required for process .gz files using MR. however , as
Sanjay mentioned , verify the codec's configured in core-site and another
thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> hadoopConf.set("mapreduce.job.inputformat.class",
> "com.wizecommerce.utils.mapred.TextInputFormat");
>
> hadoopConf.set("mapreduce.job.outputformat.class",
> "com.wizecommerce.utils.mapred.TextOutputFormat");
> No special settings required for reading Gzip except these above
>
> I u want to output Gzip
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
> "org.apache.hadoop.io.compress.GzipCodec");
>
> Make sure Gzip codec is defined in core-site.xml
> <!-- core-site.xml -->
> <property>
> <name>io.compression.codecs</name>
> <value
> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
> value>
> </property>
>
> I have a question
>
> Why are u using GZIP as input to Map ? These are not splittable…Unless u
> have to read multilines (like lines between a BEGIN and END block in a log
> file) and send it as one record to the mapper
>
> Also in Non-splitable Snappy Codec is better
>
> Good Luck
>
>
> sanjay
>
> From: samir das mohapatra <sa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:07 PM
> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Now give .gz file as input to the MAP
>
> Hi All,
> Did any one worked on, how to pass the .gz file as file input for
> mapreduce job ?
>
> Regards,
> samir.
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Nothing special is required for process .gz files using MR. however , as
Sanjay mentioned , verify the codec's configured in core-site and another
thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> hadoopConf.set("mapreduce.job.inputformat.class",
> "com.wizecommerce.utils.mapred.TextInputFormat");
>
> hadoopConf.set("mapreduce.job.outputformat.class",
> "com.wizecommerce.utils.mapred.TextOutputFormat");
> No special settings required for reading Gzip except these above
>
> I u want to output Gzip
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
> "org.apache.hadoop.io.compress.GzipCodec");
>
> Make sure Gzip codec is defined in core-site.xml
> <!-- core-site.xml -->
> <property>
> <name>io.compression.codecs</name>
> <value
> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
> value>
> </property>
>
> I have a question
>
> Why are u using GZIP as input to Map ? These are not splittable…Unless u
> have to read multilines (like lines between a BEGIN and END block in a log
> file) and send it as one record to the mapper
>
> Also in Non-splitable Snappy Codec is better
>
> Good Luck
>
>
> sanjay
>
> From: samir das mohapatra <sa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:07 PM
> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Now give .gz file as input to the MAP
>
> Hi All,
> Did any one worked on, how to pass the .gz file as file input for
> mapreduce job ?
>
> Regards,
> samir.
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Nothing special is required for process .gz files using MR. however , as
Sanjay mentioned , verify the codec's configured in core-site and another
thing to note is that these files are not splittable.
You might want to use bz2 , these are splittable.
Thanks,
Rahul
On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:
> hadoopConf.set("mapreduce.job.inputformat.class",
> "com.wizecommerce.utils.mapred.TextInputFormat");
>
> hadoopConf.set("mapreduce.job.outputformat.class",
> "com.wizecommerce.utils.mapred.TextOutputFormat");
> No special settings required for reading Gzip except these above
>
> I u want to output Gzip
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>
> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
> "org.apache.hadoop.io.compress.GzipCodec");
>
> Make sure Gzip codec is defined in core-site.xml
> <!-- core-site.xml -->
> <property>
> <name>io.compression.codecs</name>
> <value
> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
> value>
> </property>
>
> I have a question
>
> Why are u using GZIP as input to Map ? These are not splittable…Unless u
> have to read multilines (like lines between a BEGIN and END block in a log
> file) and send it as one record to the mapper
>
> Also in Non-splitable Snappy Codec is better
>
> Good Luck
>
>
> sanjay
>
> From: samir das mohapatra <sa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, June 11, 2013 9:07 PM
> To: "cdh-user@cloudera.com" <cd...@cloudera.com>, "
> user@hadoop.apache.org" <us...@hadoop.apache.org>, "
> user-help@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Now give .gz file as input to the MAP
>
> Hi All,
> Did any one worked on, how to pass the .gz file as file input for
> mapreduce job ?
>
> Regards,
> samir.
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
hadoopConf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
hadoopConf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
No special settings required for reading Gzip except these above
I u want to output Gzip
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
Make sure Gzip codec is defined in core-site.xml
<!-- core-site.xml -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
I have a question
Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper
Also in Non-splitable Snappy Codec is better
Good Luck
sanjay
From: samir das mohapatra <sa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-user@cloudera.com<ma...@cloudera.com>" <cd...@cloudera.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>, "user-help@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP
Hi All,
Did any one worked on, how to pass the .gz file as file input for mapreduce job ?
Regards,
samir.
CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.