You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "hao.wang" <ha...@ipinyou.com> on 2012/01/09 13:20:44 UTC

how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Hi ,all
    Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
    Each node has 2 * 12 cores with 32G RAM
    Dose anyone tell me how to config following parameters:
    mapred.tasktracker.map.tasks.maximum
    mapred.tasktracker.reduce.tasks.maximum

regards!
2012-01-09 



hao.wang

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by Harsh J <ha...@cloudera.com>.

Hello again,

Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker).

On 10-Jan-2012, at 8:17 AM, hao.wang wrote:

> Hi,
>    Thanks for your reply!
>    I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?
> 
> regards!
> 
> 
> 2012-01-10 
> 
> 
> 
> hao.wang 
> 
> 
> 
> 发件人： Harsh J 
> 发送时间： 2012-01-09  23:19:21 
> 收件人： common-user 
> 抄送： 
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
> 
> Hi,
> Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
>> Hi ,all
>>   Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>>   Each node has 2 * 12 cores with 32G RAM
>>   Dose anyone tell me how to config following parameters:
>>   mapred.tasktracker.map.tasks.maximum
>>   mapred.tasktracker.reduce.tasks.maximum
>> 
>> regards!
>> 2012-01-09 
>> 
>> 
>> 
>> hao.wang

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by Prashant Kommireddi <pr...@gmail.com>.

Hi Hao,

Ideally you would want to leave out a core each for Tasktracker and
Datanode process' on each node. The rest could be used for maps and
reducers.

Thanks,
Prashant

2012/1/10 hao.wang <ha...@ipinyou.com>

> Hi,
>    Thanks for your help, your suggestion is very usefully.
>    I have another question that is whether the sum of maps and reduces
> equals to the total number of cores.
>
> regards!
>
>
> 2012-01-10
>
>
>
> hao.wang
>
>
>
> 发件人： Harsh J
> 发送时间： 2012-01-10 16:44:07
> 收件人： common-user
> 抄送：
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum
>
> Hello Hao,
> Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS
> (/proc/cpuinfo), so yes the total number of cores.
> On 10-Jan-2012, at 12:39 PM, hao.wang wrote:
> > Hi ,
> >
> > Thanks for your reply!
> > According to your suggestion, Maybe I can't apply it to our hadoop
> cluster.
> > Cus, each server in our hadoop cluster just contains 2 CPUs.
> >     So, I think maybe you mean the core #  but not CPU # in each searver?
> > I am looking for your reply.
> >
> > regards!
> >
> >
> > 2012-01-10
> >
> >
> >
> > hao.wang
> >
> >
> >
> > 发件人： Harsh J
> > 发送时间： 2012-01-10 11:33:38
> > 收件人： common-user
> > 抄送：
> > 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum
> >
> > Hello again,
> > Try a 4:3 ratio between maps and reduces, against a total # of available
> CPUs per node (minus one or two, for DN and HBase if you run those). Then
> tweak it as you go (more map-only loads or more map-reduce loads, that
> depends on your usage, and you can tweak the ratio accordingly over time --
> changing those props do not need JobTracker restarts, just TaskTracker).
> > On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
> >> Hi,
> >>   Thanks for your reply!
> >>   I had already read the pages before, can you give me sme more
> specific suggestions about how to choose the values of
>  mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum according to our cluster
> configuration if possible?
> >>
> >> regards!
> >>
> >>
> >> 2012-01-10
> >>
> >>
> >>
> >> hao.wang
> >>
> >>
> >>
> >> 发件人： Harsh J
> >> 发送时间： 2012-01-09 23:19:21
> >> 收件人： common-user
> >> 抄送：
> >> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum
> >>
> >> Hi,
> >> Please read
> http://hadoop.apache.org/common/docs/current/single_node_setup.html to
> learn how to configure Hadoop using the various *-site.xml configuration
> files, and then follow
> http://hadoop.apache.org/common/docs/current/cluster_setup.html to
> achieve optimal configs for your cluster.
> >> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
> >>> Hi ,all
> >>>  Our hadoop cluster has 22 nodes including one namenode, one
> jobtracker and 20 datanodes.
> >>>  Each node has 2 * 12 cores with 32G RAM
> >>>  Dose anyone tell me how to config following parameters:
> >>>  mapred.tasktracker.map.tasks.maximum
> >>>  mapred.tasktracker.reduce.tasks.maximum
> >>>
> >>> regards!
> >>> 2012-01-09
> >>>
> >>>
> >>>
> >>> hao.wang
>

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by Harsh J <ha...@cloudera.com>.

Yes, divide the number of cores between map and reduce slots. Depending on your workload, start with a 4:3 ratio and work your way to better tuning eventually (if you have more map-only jobs, adjust ratio accordingly, etc.).

Changing slot params requires TaskTracker restarts alone, not JobTracker, so you can do it without much troubles on a live cluster too.

On 10-Jan-2012, at 3:20 PM, hao.wang wrote:

> Hi,
>    Thanks for your help, your suggestion is very usefully.
>    I have another question that is whether the sum of maps and reduces equals to the total number of cores.
> 
> regards!
> 
> 
> 2012-01-10 
> 
> 
> 
> hao.wang 
> 
> 
> 
> 发件人： Harsh J 
> 发送时间： 2012-01-10  16:44:07 
> 收件人： common-user 
> 抄送： 
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
> 
> Hello Hao,
> Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS (/proc/cpuinfo), so yes the total number of cores.
> On 10-Jan-2012, at 12:39 PM, hao.wang wrote:
>> Hi , 
>> 
>> Thanks for your reply!
>> According to your suggestion, Maybe I can't apply it to our hadoop cluster.
>> Cus, each server in our hadoop cluster just contains 2 CPUs. 
>>    So, I think maybe you mean the core #  but not CPU # in each searver? 
>> I am looking for your reply.
>> 
>> regards!
>> 
>> 
>> 2012-01-10 
>> 
>> 
>> 
>> hao.wang 
>> 
>> 
>> 
>> 发件人： Harsh J 
>> 发送时间： 2012-01-10  11:33:38 
>> 收件人： common-user 
>> 抄送： 
>> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
>> 
>> Hello again,
>> Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker).
>> On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
>>> Hi,
>>>  Thanks for your reply!
>>>  I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?
>>> 
>>> regards!
>>> 
>>> 
>>> 2012-01-10 
>>> 
>>> 
>>> 
>>> hao.wang 
>>> 
>>> 
>>> 
>>> 发件人： Harsh J 
>>> 发送时间： 2012-01-09  23:19:21 
>>> 收件人： common-user 
>>> 抄送： 
>>> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
>>> 
>>> Hi,
>>> Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
>>> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
>>>> Hi ,all
>>>> Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>>>> Each node has 2 * 12 cores with 32G RAM
>>>> Dose anyone tell me how to config following parameters:
>>>> mapred.tasktracker.map.tasks.maximum
>>>> mapred.tasktracker.reduce.tasks.maximum
>>>> 
>>>> regards!
>>>> 2012-01-09 
>>>> 
>>>> 
>>>> 
>>>> hao.wang

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by "hao.wang" <ha...@ipinyou.com>.

Hi,
    Thanks for your help, your suggestion is very usefully.
    I have another question that is whether the sum of maps and reduces equals to the total number of cores.

regards!


2012-01-10 



hao.wang 



发件人： Harsh J 
发送时间： 2012-01-10  16:44:07 
收件人： common-user 
抄送： 
主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
 
Hello Hao,
Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS (/proc/cpuinfo), so yes the total number of cores.
On 10-Jan-2012, at 12:39 PM, hao.wang wrote:
> Hi , 
> 
> Thanks for your reply!
> According to your suggestion, Maybe I can't apply it to our hadoop cluster.
> Cus, each server in our hadoop cluster just contains 2 CPUs. 
>     So, I think maybe you mean the core #  but not CPU # in each searver? 
> I am looking for your reply.
> 
> regards!
> 
> 
> 2012-01-10 
> 
> 
> 
> hao.wang 
> 
> 
> 
> 发件人： Harsh J 
> 发送时间： 2012-01-10  11:33:38 
> 收件人： common-user 
> 抄送： 
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
> 
> Hello again,
> Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker).
> On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
>> Hi,
>>   Thanks for your reply!
>>   I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?
>> 
>> regards!
>> 
>> 
>> 2012-01-10 
>> 
>> 
>> 
>> hao.wang 
>> 
>> 
>> 
>> 发件人： Harsh J 
>> 发送时间： 2012-01-09  23:19:21 
>> 收件人： common-user 
>> 抄送： 
>> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
>> 
>> Hi,
>> Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
>> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
>>> Hi ,all
>>>  Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>>>  Each node has 2 * 12 cores with 32G RAM
>>>  Dose anyone tell me how to config following parameters:
>>>  mapred.tasktracker.map.tasks.maximum
>>>  mapred.tasktracker.reduce.tasks.maximum
>>> 
>>> regards!
>>> 2012-01-09 
>>> 
>>> 
>>> 
>>> hao.wang

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by Harsh J <ha...@cloudera.com>.

Hello Hao,

Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS (/proc/cpuinfo), so yes the total number of cores.

On 10-Jan-2012, at 12:39 PM, hao.wang wrote:

> Hi , 
> 
> Thanks for your reply!
> According to your suggestion, Maybe I can't apply it to our hadoop cluster.
> Cus, each server in our hadoop cluster just contains 2 CPUs. 
>     So, I think maybe you mean the core #  but not CPU # in each searver? 
> I am looking for your reply.
> 
> regards!
> 
> 
> 2012-01-10 
> 
> 
> 
> hao.wang 
> 
> 
> 
> 发件人： Harsh J 
> 发送时间： 2012-01-10  11:33:38 
> 收件人： common-user 
> 抄送： 
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
> 
> Hello again,
> Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker).
> On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
>> Hi,
>>   Thanks for your reply!
>>   I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?
>> 
>> regards!
>> 
>> 
>> 2012-01-10 
>> 
>> 
>> 
>> hao.wang 
>> 
>> 
>> 
>> 发件人： Harsh J 
>> 发送时间： 2012-01-09  23:19:21 
>> 收件人： common-user 
>> 抄送： 
>> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
>> 
>> Hi,
>> Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
>> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
>>> Hi ,all
>>>  Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>>>  Each node has 2 * 12 cores with 32G RAM
>>>  Dose anyone tell me how to config following parameters:
>>>  mapred.tasktracker.map.tasks.maximum
>>>  mapred.tasktracker.reduce.tasks.maximum
>>> 
>>> regards!
>>> 2012-01-09 
>>> 
>>> 
>>> 
>>> hao.wang

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by "hao.wang" <ha...@ipinyou.com>.

Hi , 

Thanks for your reply!
According to your suggestion, Maybe I can't apply it to our hadoop cluster.
Cus, each server in our hadoop cluster just contains 2 CPUs. 
     So, I think maybe you mean the core #  but not CPU # in each searver? 
I am looking for your reply.

regards!


2012-01-10 



hao.wang 



发件人： Harsh J 
发送时间： 2012-01-10  11:33:38 
收件人： common-user 
抄送： 
主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
 
Hello again,
Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker).
On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
> Hi,
>    Thanks for your reply!
>    I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?
> 
> regards!
> 
> 
> 2012-01-10 
> 
> 
> 
> hao.wang 
> 
> 
> 
> 发件人： Harsh J 
> 发送时间： 2012-01-09  23:19:21 
> 收件人： common-user 
> 抄送： 
> 主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
> 
> Hi,
> Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
> On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
>> Hi ,all
>>   Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>>   Each node has 2 * 12 cores with 32G RAM
>>   Dose anyone tell me how to config following parameters:
>>   mapred.tasktracker.map.tasks.maximum
>>   mapred.tasktracker.reduce.tasks.maximum
>> 
>> regards!
>> 2012-01-09 
>> 
>> 
>> 
>> hao.wang

Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by "hao.wang" <ha...@ipinyou.com>.

Hi,
    Thanks for your reply!
    I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of  mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible?

regards!


2012-01-10 



hao.wang 



发件人： Harsh J 
发送时间： 2012-01-09  23:19:21 
收件人： common-user 
抄送： 
主题： Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum 
 
Hi,
Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.
On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
> Hi ,all
>    Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>    Each node has 2 * 12 cores with 32G RAM
>    Dose anyone tell me how to config following parameters:
>    mapred.tasktracker.map.tasks.maximum
>    mapred.tasktracker.reduce.tasks.maximum
> 
> regards!
> 2012-01-09 
> 
> 
> 
> hao.wang

Re: has bzip2 compression been deprecated?

Posted by Bejoy Ks <be...@gmail.com>.

Hi Tony
      Please find responses inline

So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW
FORMAT and other parameters you mention are telling Hive what to expect
when it reads the data I want to analyse, despite not checking the data to
see if it meets these criteria?

[Bejoy] Yes, no data format validation is performed on CREATE TABLE. You
get to know the data issues when you QUERY the table.

Do these guidelines still apply if the table is not EXTERNAL?

[Bejoy] Yes, EXTERNAL tables are not far different from hive managed tables
(normal tables) . The basic diffrence is that when you do CREATE TABLE the
data dir is created under /usr/hive/warehouse (in default conf) and in case
of EXTERNAL TABLES you can point to any dir in hdfs as the data dir. The
main difference to keep in mind is if you DROP an EXTERNAL TABLE the data
dir in hdfs is not deleted where in case of NORMAL TABLES it is deleted
(you completely lose data here).

Regards
Bejoy.K.S

On Tue, Jan 10, 2012 at 5:12 PM, Harsh J <ha...@cloudera.com> wrote:

> Tony,
>
> Sorry for being ambiguous, I was too lazy to search at the time. This has
> been the case since release 0.18.0. See
> https://issues.apache.org/jira/browse/HADOOP-2095 for more information.
>
> On 10-Jan-2012, at 4:18 PM, Tony Burton wrote:
>
> > Thanks all for advice - one more question on re-reading Harsh's helpful
> reply. " Intermediate (M-to-R) files use a custom IFile format these days".
> How recently is "these days", and can this addition be pinned down to any
> one version of Hadoop?
> >
> > Tony
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: 09 January 2012 16:50
> > To: common-user@hadoop.apache.org
> > Subject: Re: has bzip2 compression been deprecated?
> >
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> > and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>

Re: has bzip2 compression been deprecated?

Posted by Harsh J <ha...@cloudera.com>.

Tony,

Sorry for being ambiguous, I was too lazy to search at the time. This has been the case since release 0.18.0. See https://issues.apache.org/jira/browse/HADOOP-2095 for more information.

On 10-Jan-2012, at 4:18 PM, Tony Burton wrote:

> Thanks all for advice - one more question on re-reading Harsh's helpful reply. " Intermediate (M-to-R) files use a custom IFile format these days". How recently is "these days", and can this addition be pinned down to any one version of Hadoop?
> 
> Tony
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: 09 January 2012 16:50
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Tony,
> 
> * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out (instead of a plain "fs -cat"). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format.
> * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose.
> * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC).
> 
> On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> 
>> Thanks for the quick reply and the clarification about the documentation.
>> 
>> Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? 
>> 
>> Tony
>> 
>> 
>> 
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com] 
>> Sent: 09 January 2012 15:34
>> To: common-user@hadoop.apache.org
>> Subject: Re: has bzip2 compression been deprecated?
>> 
>> Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).
>> 
>> To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.
>> 
>> The best way would be to use either:
>> 
>> (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
>> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.
>> 
>> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
>>> 
>>> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>>> 
>>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>>> 
>>> Thanks,
>>> 
>>> Tony
>>> 
>>> 
>>> 
>>> **********************************************************************
>>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>>> and approved by Sporting Index Ltd.
>>> 
>>> Outbound email has been scanned for viruses and SPAM
>> 
>> www.sportingindex.com
>> Inbound Email has been scanned for viruses and SPAM 
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
> 
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

RE: has bzip2 compression been deprecated?

Posted by Tony Burton <TB...@SportingIndex.com>.

Thanks all for advice - one more question on re-reading Harsh's helpful reply. " Intermediate (M-to-R) files use a custom IFile format these days". How recently is "these days", and can this addition be pinned down to any one version of Hadoop?

Tony





-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 09 January 2012 16:50
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Tony,

* Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out (instead of a plain "fs -cat"). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

> Thanks for the quick reply and the clarification about the documentation.
> 
> Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? 
> 
> Tony
> 
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: 09 January 2012 15:34
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).
> 
> To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.
> 
> The best way would be to use either:
> 
> (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.
> 
> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> 
>> Hi,
>> 
>> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
>> 
>> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>> 
>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>> 
>> Thanks,
>> 
>> Tony
>> 
>> 
>> 
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
> 
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: has bzip2 compression been deprecated?

Posted by Joey Echeverria <jo...@cloudera.com>.

Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... .

-Joey

On Jan 10, 2012, at 6:08, Tony Burton <TB...@SportingIndex.com> wrote:

> Thanks for this Bejoy, very helpful. 
> 
> So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW FORMAT and other parameters you mention are telling Hive what to expect when it reads the data I want to analyse, despite not checking the data to see if it meets these criteria?
> 
> Do these guidelines still apply if the table is not EXTERNAL?
> 
> Tony
> 
> 
> 
> -----Original Message-----
> From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com] 
> Sent: 09 January 2012 19:00
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Hi Tony
>       As  I understand your requirement, your mapreduce job produces a
> Sequence File as ouput and you need to use this file as an input to hive
> table.
>        When you CREATE and EXTERNAL Table in hive you specify a location
> where your data is stored and also what is the format of that data( like
> the field delimiter,row delimiter, file type etc of your data). You are
> actually not loading data any where when you create a hive external
> table(issue DDL), just specifying where the data lies in file system in
> fact there is not even any validation performed that time to check on the
> data quality. When you Query/Retrive your data  through Hive QLs the
> parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
> TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).
> 
>     In short STORED AS refer to the type of files that a table's data
> directory holds.
> 
> For details
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> 
> Hope it helps!..
> 
> Regards
> Bejoy.K.S
> 
> On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton <TB...@sportingindex.com>wrote:
> 
>> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
>> under the impression that the STORED AS part of a CREATE TABLE in Hive
>> refers to how the data in the table will be stored once the table is
>> created, rather than the compression format of the data used to populate
>> the table. Can you clarify which is the correct interpretation? If it's the
>> latter, how would I read a sequence file into a Hive table?
>> 
>> Thanks,
>> 
>> Tony
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
>> Sent: 09 January 2012 17:33
>> To: common-user@hadoop.apache.org
>> Subject: Re: has bzip2 compression been deprecated?
>> 
>> Hi Tony
>>      Adding on to Harsh's comments. If you want the generated sequence
>> files to be utilized by a hive table. Define your hive table as
>> 
>> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
>> ...
>> ...
>> ....
>> STORED AS SEQUENCEFILE;
>> 
>> 
>> Regards
>> Bejoy.K.S
>> 
>> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:
>> 
>>> Tony,
>>> 
>>> snappy is also available:
>>> http://code.google.com/p/hadoop-snappy/
>>> 
>>> best,
>>> Alex
>>> 
>>> --
>>> Alexander Lorenz
>>> http://mapredit.blogspot.com
>>> 
>>> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>>> 
>>>> Tony,
>>>> 
>>>> * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
>>> out (instead of a plain "fs -cat"). But if you are gonna export your
>> files
>>> into a system you do not have much control over, probably best to have
>> the
>>> resultant files not be in SequenceFile/Avro-DataFile format.
>>>> * Intermediate (M-to-R) files use a custom IFile format these days,
>>> which is built purely for that purpose.
>>>> * Hive can use SequenceFiles very well. There is also documented info
>> on
>>> this in the Hive's wiki pages (Check the DDL pages, IIRC).
>>>> 
>>>> On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
>>>> 
>>>>> Thanks for the quick reply and the clarification about the
>>> documentation.
>>>>> 
>>>>> Regarding sequence files: am I right in thinking that they're a good
>>> choice for intermediate steps in chained MR jobs, or for file transfer
>>> between the Map and the Reduce phases of a job; but they shouldn't be
>> used
>>> for human-readable files at the end of one or more MapReduce jobs? How
>>> about if the only use a job's output is analysis via Hive - can Hive
>> create
>>> tables from sequence files?
>>>>> 
>>>>> Tony
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Harsh J [mailto:harsh@cloudera.com]
>>>>> Sent: 09 January 2012 15:34
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: has bzip2 compression been deprecated?
>>>>> 
>>>>> Bzip2 is pretty slow. You probably do not want to use it, even if it
>>> does file splits (a feature not available in the stable line of
>> 0.20.x/1.x,
>>> but available in 0.22+).
>>>>> 
>>>>> To answer your question though, bzip2 was removed from that document
>>> cause it isn't a native library (its pure Java). I think bzip2 was added
>>> earlier due to an oversight, as even 0.20 did not have a native bzip2
>>> library. This change in docs does not mean that BZip2 is deprecated -- it
>>> is still fully supported and available in the trunk as well. See
>>> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
>>> changes that led to this.
>>>>> 
>>>>> The best way would be to use either:
>>>>> 
>>>>> (a) Hadoop sequence files with any compression codec of choice (best
>>> would be lzo, gz, maybe even snappy). This file format is built for HDFS
>>> and MR and is splittable. Another choice would be Avro DataFiles from the
>>> Apache Avro project.
>>>>> (b) LZO codecs for Hadoop, via
>> https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for
>> packages). This requires you to run indexing
>>> operations before the .lzo can be made splittable, but works great with
>>> this extra step added.
>>>>> 
>>>>> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm trying to work out which compression algorithm I should be using
>>> in my MapReduce jobs.  It seems to me that the best solution is a
>>> compromise between speed, efficiency and splittability. The only
>>> compression algorithm to handle file splits (according to Hadoop: The
>>> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
>>> compression speed.
>>>>>> 
>>>>>> However, I see from the documentation at
>>> http://hadoop.apache.org/common/docs/current/native_libraries.html that
>>> the bzip2 library is no longer mentioned, and hasn't been since version
>>> 0.20.0, see
>>> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
>>> however the bzip2 Codec is still in the API at
>>> 
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
>>> .
>>>>>> 
>>>>>> Has bzip2 support been removed from Hadoop, or will it be removed
>> soon?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Tony
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>> **********************************************************************
>>>>>> This email and any attachments are confidential, protected by
>>> copyright and may be legally privileged.  If you are not the intended
>>> recipient, then the dissemination or copying of this email is prohibited.
>>> If you have received this in error, please notify the sender by replying
>> by
>>> email and then delete the email completely from your system.  Neither
>>> Sporting Index nor the sender accepts responsibility for any virus, or
>> any
>>> other defect which might affect any computer or IT system into which the
>>> email is received and/or opened.  It is the responsibility of the
>> recipient
>>> to scan the email and no responsibility is accepted for any loss or
>> damage
>>> arising in any way from receipt or use of this email.  Sporting Index Ltd
>>> is a company registered in England and Wales with company number 2636842,
>>> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
>>> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and
>> regulated
>>> by the UK Financial Services Authority (reg. no. 150404). Any financial
>>> promotion contained herein has been issued
>>>>>> and approved by Sporting Index Ltd.
>>>>>> 
>>>>>> Outbound email has been scanned for viruses and SPAM
>>>>> 
>>>>> www.sportingindex.com
>>>>> Inbound Email has been scanned for viruses and SPAM
>>>>> **********************************************************************
>>>>> This email and any attachments are confidential, protected by
>> copyright
>>> and may be legally privileged.  If you are not the intended recipient,
>> then
>>> the dissemination or copying of this email is prohibited. If you have
>>> received this in error, please notify the sender by replying by email and
>>> then delete the email completely from your system.  Neither Sporting
>> Index
>>> nor the sender accepts responsibility for any virus, or any other defect
>>> which might affect any computer or IT system into which the email is
>>> received and/or opened.  It is the responsibility of the recipient to
>> scan
>>> the email and no responsibility is accepted for any loss or damage
>> arising
>>> in any way from receipt or use of this email.  Sporting Index Ltd is a
>>> company registered in England and Wales with company number 2636842,
>> whose
>>> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
>>> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the
>> UK
>>> Financial Services Authority (reg. no. 150404). Any financial promotion
>>> contained herein has been issued
>>>>> and approved by Sporting Index Ltd.
>>>>> 
>>>>> Outbound email has been scanned for viruses and SPAM
>>>> 
>>> 
>>> 
>> 
>> www.sportingindex.com
>> Inbound Email has been scanned for viruses and SPAM
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright
>> and may be legally privileged.  If you are not the intended recipient, then
>> the dissemination or copying of this email is prohibited. If you have
>> received this in error, please notify the sender by replying by email and
>> then delete the email completely from your system.  Neither Sporting Index
>> nor the sender accepts responsibility for any virus, or any other defect
>> which might affect any computer or IT system into which the email is
>> received and/or opened.  It is the responsibility of the recipient to scan
>> the email and no responsibility is accepted for any loss or damage arising
>> in any way from receipt or use of this email.  Sporting Index Ltd is a
>> company registered in England and Wales with company number 2636842, whose
>> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
>> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
>> Financial Services Authority (reg. no. 150404). Any financial promotion
>> contained herein has been issued
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
>> 
> 
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

RE: has bzip2 compression been deprecated?

Posted by Tony Burton <TB...@SportingIndex.com>.

Thanks for this Bejoy, very helpful. 

So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW FORMAT and other parameters you mention are telling Hive what to expect when it reads the data I want to analyse, despite not checking the data to see if it meets these criteria?

Do these guidelines still apply if the table is not EXTERNAL?

Tony

 

-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com] 
Sent: 09 January 2012 19:00
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       As  I understand your requirement, your mapreduce job produces a
Sequence File as ouput and you need to use this file as an input to hive
table.
        When you CREATE and EXTERNAL Table in hive you specify a location
where your data is stored and also what is the format of that data( like
the field delimiter,row delimiter, file type etc of your data). You are
actually not loading data any where when you create a hive external
table(issue DDL), just specifying where the data lies in file system in
fact there is not even any validation performed that time to check on the
data quality. When you Query/Retrive your data  through Hive QLs the
parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).

     In short STORED AS refer to the type of files that a table's data
directory holds.

For details
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Hope it helps!..

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton <TB...@sportingindex.com>wrote:

> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
> under the impression that the STORED AS part of a CREATE TABLE in Hive
> refers to how the data in the table will be stored once the table is
> created, rather than the compression format of the data used to populate
> the table. Can you clarify which is the correct interpretation? If it's the
> latter, how would I read a sequence file into a Hive table?
>
> Thanks,
>
> Tony
>
>
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
> Sent: 09 January 2012 17:33
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
>
> Hi Tony
>       Adding on to Harsh's comments. If you want the generated sequence
> files to be utilized by a hive table. Define your hive table as
>
> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
> ...
> ...
> ....
> STORED AS SEQUENCEFILE;
>
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:
>
> > Tony,
> >
> > snappy is also available:
> > http://code.google.com/p/hadoop-snappy/
> >
> > best,
> >  Alex
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
> >
> > > Tony,
> > >
> > > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> > out (instead of a plain "fs -cat"). But if you are gonna export your
> files
> > into a system you do not have much control over, probably best to have
> the
> > resultant files not be in SequenceFile/Avro-DataFile format.
> > > * Intermediate (M-to-R) files use a custom IFile format these days,
> > which is built purely for that purpose.
> > > * Hive can use SequenceFiles very well. There is also documented info
> on
> > this in the Hive's wiki pages (Check the DDL pages, IIRC).
> > >
> > > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> > >
> > >> Thanks for the quick reply and the clarification about the
> > documentation.
> > >>
> > >> Regarding sequence files: am I right in thinking that they're a good
> > choice for intermediate steps in chained MR jobs, or for file transfer
> > between the Map and the Reduce phases of a job; but they shouldn't be
> used
> > for human-readable files at the end of one or more MapReduce jobs? How
> > about if the only use a job's output is analysis via Hive - can Hive
> create
> > tables from sequence files?
> > >>
> > >> Tony
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Harsh J [mailto:harsh@cloudera.com]
> > >> Sent: 09 January 2012 15:34
> > >> To: common-user@hadoop.apache.org
> > >> Subject: Re: has bzip2 compression been deprecated?
> > >>
> > >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> > does file splits (a feature not available in the stable line of
> 0.20.x/1.x,
> > but available in 0.22+).
> > >>
> > >> To answer your question though, bzip2 was removed from that document
> > cause it isn't a native library (its pure Java). I think bzip2 was added
> > earlier due to an oversight, as even 0.20 did not have a native bzip2
> > library. This change in docs does not mean that BZip2 is deprecated -- it
> > is still fully supported and available in the trunk as well. See
> > https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> > changes that led to this.
> > >>
> > >> The best way would be to use either:
> > >>
> > >> (a) Hadoop sequence files with any compression codec of choice (best
> > would be lzo, gz, maybe even snappy). This file format is built for HDFS
> > and MR and is splittable. Another choice would be Avro DataFiles from the
> > Apache Avro project.
> > >> (b) LZO codecs for Hadoop, via
> https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for
> packages). This requires you to run indexing
> > operations before the .lzo can be made splittable, but works great with
> > this extra step added.
> > >>
> > >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I'm trying to work out which compression algorithm I should be using
> > in my MapReduce jobs.  It seems to me that the best solution is a
> > compromise between speed, efficiency and splittability. The only
> > compression algorithm to handle file splits (according to Hadoop: The
> > Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> > compression speed.
> > >>>
> > >>> However, I see from the documentation at
> > http://hadoop.apache.org/common/docs/current/native_libraries.html that
> > the bzip2 library is no longer mentioned, and hasn't been since version
> > 0.20.0, see
> > http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> > however the bzip2 Codec is still in the API at
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> > .
> > >>>
> > >>> Has bzip2 support been removed from Hadoop, or will it be removed
> soon?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Tony
> > >>>
> > >>>
> > >>>
> > >>>
> **********************************************************************
> > >>> This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> > recipient, then the dissemination or copying of this email is prohibited.
> > If you have received this in error, please notify the sender by replying
> by
> > email and then delete the email completely from your system.  Neither
> > Sporting Index nor the sender accepts responsibility for any virus, or
> any
> > other defect which might affect any computer or IT system into which the
> > email is received and/or opened.  It is the responsibility of the
> recipient
> > to scan the email and no responsibility is accepted for any loss or
> damage
> > arising in any way from receipt or use of this email.  Sporting Index Ltd
> > is a company registered in England and Wales with company number 2636842,
> > whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> > Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and
> regulated
> > by the UK Financial Services Authority (reg. no. 150404). Any financial
> > promotion contained herein has been issued
> > >>> and approved by Sporting Index Ltd.
> > >>>
> > >>> Outbound email has been scanned for viruses and SPAM
> > >>
> > >> www.sportingindex.com
> > >> Inbound Email has been scanned for viruses and SPAM
> > >> **********************************************************************
> > >> This email and any attachments are confidential, protected by
> copyright
> > and may be legally privileged.  If you are not the intended recipient,
> then
> > the dissemination or copying of this email is prohibited. If you have
> > received this in error, please notify the sender by replying by email and
> > then delete the email completely from your system.  Neither Sporting
> Index
> > nor the sender accepts responsibility for any virus, or any other defect
> > which might affect any computer or IT system into which the email is
> > received and/or opened.  It is the responsibility of the recipient to
> scan
> > the email and no responsibility is accepted for any loss or damage
> arising
> > in any way from receipt or use of this email.  Sporting Index Ltd is a
> > company registered in England and Wales with company number 2636842,
> whose
> > registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> > Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the
> UK
> > Financial Services Authority (reg. no. 150404). Any financial promotion
> > contained herein has been issued
> > >> and approved by Sporting Index Ltd.
> > >>
> > >> Outbound email has been scanned for viruses and SPAM
> > >
> >
> >
>
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: has bzip2 compression been deprecated?

Posted by Bejoy Ks <be...@gmail.com>.

Hi Tony
       As  I understand your requirement, your mapreduce job produces a
Sequence File as ouput and you need to use this file as an input to hive
table.
        When you CREATE and EXTERNAL Table in hive you specify a location
where your data is stored and also what is the format of that data( like
the field delimiter,row delimiter, file type etc of your data). You are
actually not loading data any where when you create a hive external
table(issue DDL), just specifying where the data lies in file system in
fact there is not even any validation performed that time to check on the
data quality. When you Query/Retrive your data  through Hive QLs the
parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).

     In short STORED AS refer to the type of files that a table's data
directory holds.

For details
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Hope it helps!..

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton <TB...@sportingindex.com>wrote:

> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
> under the impression that the STORED AS part of a CREATE TABLE in Hive
> refers to how the data in the table will be stored once the table is
> created, rather than the compression format of the data used to populate
> the table. Can you clarify which is the correct interpretation? If it's the
> latter, how would I read a sequence file into a Hive table?
>
> Thanks,
>
> Tony
>
>
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
> Sent: 09 January 2012 17:33
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
>
> Hi Tony
>       Adding on to Harsh's comments. If you want the generated sequence
> files to be utilized by a hive table. Define your hive table as
>
> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
> ...
> ...
> ....
> STORED AS SEQUENCEFILE;
>
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:
>
> > Tony,
> >
> > snappy is also available:
> > http://code.google.com/p/hadoop-snappy/
> >
> > best,
> >  Alex
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
> >
> > > Tony,
> > >
> > > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> > out (instead of a plain "fs -cat"). But if you are gonna export your
> files
> > into a system you do not have much control over, probably best to have
> the
> > resultant files not be in SequenceFile/Avro-DataFile format.
> > > * Intermediate (M-to-R) files use a custom IFile format these days,
> > which is built purely for that purpose.
> > > * Hive can use SequenceFiles very well. There is also documented info
> on
> > this in the Hive's wiki pages (Check the DDL pages, IIRC).
> > >
> > > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> > >
> > >> Thanks for the quick reply and the clarification about the
> > documentation.
> > >>
> > >> Regarding sequence files: am I right in thinking that they're a good
> > choice for intermediate steps in chained MR jobs, or for file transfer
> > between the Map and the Reduce phases of a job; but they shouldn't be
> used
> > for human-readable files at the end of one or more MapReduce jobs? How
> > about if the only use a job's output is analysis via Hive - can Hive
> create
> > tables from sequence files?
> > >>
> > >> Tony
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Harsh J [mailto:harsh@cloudera.com]
> > >> Sent: 09 January 2012 15:34
> > >> To: common-user@hadoop.apache.org
> > >> Subject: Re: has bzip2 compression been deprecated?
> > >>
> > >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> > does file splits (a feature not available in the stable line of
> 0.20.x/1.x,
> > but available in 0.22+).
> > >>
> > >> To answer your question though, bzip2 was removed from that document
> > cause it isn't a native library (its pure Java). I think bzip2 was added
> > earlier due to an oversight, as even 0.20 did not have a native bzip2
> > library. This change in docs does not mean that BZip2 is deprecated -- it
> > is still fully supported and available in the trunk as well. See
> > https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> > changes that led to this.
> > >>
> > >> The best way would be to use either:
> > >>
> > >> (a) Hadoop sequence files with any compression codec of choice (best
> > would be lzo, gz, maybe even snappy). This file format is built for HDFS
> > and MR and is splittable. Another choice would be Avro DataFiles from the
> > Apache Avro project.
> > >> (b) LZO codecs for Hadoop, via
> https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for
> packages). This requires you to run indexing
> > operations before the .lzo can be made splittable, but works great with
> > this extra step added.
> > >>
> > >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I'm trying to work out which compression algorithm I should be using
> > in my MapReduce jobs.  It seems to me that the best solution is a
> > compromise between speed, efficiency and splittability. The only
> > compression algorithm to handle file splits (according to Hadoop: The
> > Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> > compression speed.
> > >>>
> > >>> However, I see from the documentation at
> > http://hadoop.apache.org/common/docs/current/native_libraries.html that
> > the bzip2 library is no longer mentioned, and hasn't been since version
> > 0.20.0, see
> > http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> > however the bzip2 Codec is still in the API at
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> > .
> > >>>
> > >>> Has bzip2 support been removed from Hadoop, or will it be removed
> soon?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Tony
> > >>>
> > >>>
> > >>>
> > >>>
> **********************************************************************
> > >>> This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> > recipient, then the dissemination or copying of this email is prohibited.
> > If you have received this in error, please notify the sender by replying
> by
> > email and then delete the email completely from your system.  Neither
> > Sporting Index nor the sender accepts responsibility for any virus, or
> any
> > other defect which might affect any computer or IT system into which the
> > email is received and/or opened.  It is the responsibility of the
> recipient
> > to scan the email and no responsibility is accepted for any loss or
> damage
> > arising in any way from receipt or use of this email.  Sporting Index Ltd
> > is a company registered in England and Wales with company number 2636842,
> > whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> > Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and
> regulated
> > by the UK Financial Services Authority (reg. no. 150404). Any financial
> > promotion contained herein has been issued
> > >>> and approved by Sporting Index Ltd.
> > >>>
> > >>> Outbound email has been scanned for viruses and SPAM
> > >>
> > >> www.sportingindex.com
> > >> Inbound Email has been scanned for viruses and SPAM
> > >> **********************************************************************
> > >> This email and any attachments are confidential, protected by
> copyright
> > and may be legally privileged.  If you are not the intended recipient,
> then
> > the dissemination or copying of this email is prohibited. If you have
> > received this in error, please notify the sender by replying by email and
> > then delete the email completely from your system.  Neither Sporting
> Index
> > nor the sender accepts responsibility for any virus, or any other defect
> > which might affect any computer or IT system into which the email is
> > received and/or opened.  It is the responsibility of the recipient to
> scan
> > the email and no responsibility is accepted for any loss or damage
> arising
> > in any way from receipt or use of this email.  Sporting Index Ltd is a
> > company registered in England and Wales with company number 2636842,
> whose
> > registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> > Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the
> UK
> > Financial Services Authority (reg. no. 150404). Any financial promotion
> > contained herein has been issued
> > >> and approved by Sporting Index Ltd.
> > >>
> > >> Outbound email has been scanned for viruses and SPAM
> > >
> >
> >
>
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
>

RE: has bzip2 compression been deprecated?

Posted by Tim Broberg <Ti...@exar.com>.

Based on this, it seems like the best approach is just to pick block compression rather than record compression, presumeably for this very reason.

https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation

Perhaps record compression is the default to prioritize speed...

    - Tim.
________________________________________
From: Tim Broberg [Tim.Broberg@exar.com]
Sent: Monday, January 09, 2012 1:42 PM
To: common-user@hadoop.apache.org; bejoy.hadoop@gmail.com
Subject: RE: has bzip2 compression been deprecated?

I thought it was optional whether hive stored blocks (up to 1MB?) or records. If records, it's not storing individual records?

Am I misunderstanding?

Maybe I should get off my lazy butt and just check the source code...  ;^)

    - Tim.

________________________________________
From: bejoy.hadoop@gmail.com [bejoy.hadoop@gmail.com]
Sent: Monday, January 09, 2012 1:22 PM
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tim
       When you say in hive a table data is  compressed  by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS.
When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-----Original Message-----
From: Tim Broberg <Ti...@exar.com>
Date: Mon, 9 Jan 2012 12:27:47
To: common-user@hadoop.apache.org<co...@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical compressed record?

Do you have issues where the block size is too small to be compressed efficiently?

More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe.

    - Tim.

________________________________________
From: Tony Burton [TBurton@SportingIndex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony

-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

RE: has bzip2 compression been deprecated?

Posted by Tim Broberg <Ti...@exar.com>.

I thought it was optional whether hive stored blocks (up to 1MB?) or records. If records, it's not storing individual records?

Am I misunderstanding?

Maybe I should get off my lazy butt and just check the source code...  ;^)

    - Tim.

________________________________________
From: bejoy.hadoop@gmail.com [bejoy.hadoop@gmail.com]
Sent: Monday, January 09, 2012 1:22 PM
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tim
       When you say in hive a table data is  compressed  by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS.
When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-----Original Message-----
From: Tim Broberg <Ti...@exar.com>
Date: Mon, 9 Jan 2012 12:27:47
To: common-user@hadoop.apache.org<co...@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical compressed record?

Do you have issues where the block size is too small to be compressed efficiently?

More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe.

    - Tim.

________________________________________
From: Tony Burton [TBurton@SportingIndex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony

-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Re: has bzip2 compression been deprecated?

Posted by be...@gmail.com.

Hi Tim
       When you say in hive a table data is  compressed  by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS. 
When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-----Original Message-----
From: Tim Broberg <Ti...@exar.com>
Date: Mon, 9 Jan 2012 12:27:47 
To: common-user@hadoop.apache.org<co...@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical compressed record?

Do you have issues where the block size is too small to be compressed efficiently?

More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe.

    - Tim.

________________________________________
From: Tony Burton [TBurton@SportingIndex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony




-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

RE: has bzip2 compression been deprecated?

Posted by Tim Broberg <Ti...@exar.com>.

Out of curiousity, when hive records are compressed, how large is a typical compressed record?

Do you have issues where the block size is too small to be compressed efficiently?

More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe.

    - Tim.

________________________________________
From: Tony Burton [TBurton@SportingIndex.com]
Sent: Monday, January 09, 2012 10:02 AM
To: common-user@hadoop.apache.org
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony




-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

RE: has bzip2 compression been deprecated?

Posted by Tony Burton <TB...@SportingIndex.com>.

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony




-----Original Message-----
From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com] 
Sent: 09 January 2012 17:33
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: has bzip2 compression been deprecated?

Posted by Bejoy Ks <be...@gmail.com>.

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wg...@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

Re: has bzip2 compression been deprecated?

Posted by "alo.alt" <wg...@googlemail.com>.

Tony,

snappy is also available:
http://code.google.com/p/hadoop-snappy/

best,
 Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Jan 9, 2012, at 8:49 AM, Harsh J wrote:

> Tony,
> 
> * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out (instead of a plain "fs -cat"). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format.
> * Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose.
> * Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC).
> 
> On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> 
>> Thanks for the quick reply and the clarification about the documentation.
>> 
>> Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? 
>> 
>> Tony
>> 
>> 
>> 
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com] 
>> Sent: 09 January 2012 15:34
>> To: common-user@hadoop.apache.org
>> Subject: Re: has bzip2 compression been deprecated?
>> 
>> Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).
>> 
>> To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.
>> 
>> The best way would be to use either:
>> 
>> (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
>> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.
>> 
>> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
>>> 
>>> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>>> 
>>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>>> 
>>> Thanks,
>>> 
>>> Tony
>>> 
>>> 
>>> 
>>> **********************************************************************
>>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>>> and approved by Sporting Index Ltd.
>>> 
>>> Outbound email has been scanned for viruses and SPAM
>> 
>> www.sportingindex.com
>> Inbound Email has been scanned for viruses and SPAM 
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
>

Re: has bzip2 compression been deprecated?

Posted by Harsh J <ha...@cloudera.com>.

Tony,

* Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it out (instead of a plain "fs -cat"). But if you are gonna export your files into a system you do not have much control over, probably best to have the resultant files not be in SequenceFile/Avro-DataFile format.
* Intermediate (M-to-R) files use a custom IFile format these days, which is built purely for that purpose.
* Hive can use SequenceFiles very well. There is also documented info on this in the Hive's wiki pages (Check the DDL pages, IIRC).

On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:

> Thanks for the quick reply and the clarification about the documentation.
> 
> Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? 
> 
> Tony
> 
> 
> 
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com] 
> Sent: 09 January 2012 15:34
> To: common-user@hadoop.apache.org
> Subject: Re: has bzip2 compression been deprecated?
> 
> Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).
> 
> To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.
> 
> The best way would be to use either:
> 
> (a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.
> 
> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> 
>> Hi,
>> 
>> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
>> 
>> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
>> 
>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
>> 
>> Thanks,
>> 
>> Tony
>> 
>> 
>> 
>> **********************************************************************
>> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
>> and approved by Sporting Index Ltd.
>> 
>> Outbound email has been scanned for viruses and SPAM
> 
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

RE: has bzip2 compression been deprecated?

Posted by Tony Burton <TB...@SportingIndex.com>.

Thanks for the quick reply and the clarification about the documentation.

Regarding sequence files: am I right in thinking that they're a good choice for intermediate steps in chained MR jobs, or for file transfer between the Map and the Reduce phases of a job; but they shouldn't be used for human-readable files at the end of one or more MapReduce jobs? How about if the only use a job's output is analysis via Hive - can Hive create tables from sequence files? 

Tony

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 09 January 2012 15:34
To: common-user@hadoop.apache.org
Subject: Re: has bzip2 compression been deprecated?

Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).

To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

> Hi,
> 
> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
> 
> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
> 
> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> 
> Thanks,
> 
> Tony
> 
> 
> 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: has bzip2 compression been deprecated?

Posted by Harsh J <ha...@cloudera.com>.

Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature not available in the stable line of 0.20.x/1.x, but available in 0.22+).

To answer your question though, bzip2 was removed from that document cause it isn't a native library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20 did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated -- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update changes that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe even snappy). This file format is built for HDFS and MR and is splittable. Another choice would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager for packages). This requires you to run indexing operations before the .lzo can be made splittable, but works great with this extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

> Hi,
> 
> I'm trying to work out which compression algorithm I should be using in my MapReduce jobs.  It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
> 
> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
> 
> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> 
> Thanks,
> 
> Tony
> 
> 
> 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM

has bzip2 compression been deprecated?

Posted by Tony Burton <TB...@SportingIndex.com>.

Hi,

I'm trying to work out which compression algorithm I should be using in my MapReduce jobs. It seems to me that the best solution is a compromise between speed, efficiency and splittability. The only compression algorithm to handle file splits (according to Hadoop: The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.

However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html - however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.

Has bzip2 support been removed from Hadoop, or will it be removed soon?

Thanks,

Tony

**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster.

On 09-Jan-2012, at 5:50 PM, hao.wang wrote:

> Hi ,all
>    Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes.
>    Each node has 2 * 12 cores with 32G RAM
>    Dose anyone tell me how to config following parameters:
>    mapred.tasktracker.map.tasks.maximum
>    mapred.tasktracker.reduce.tasks.maximum
> 
> regards!
> 2012-01-09 
> 
> 
> 
> hao.wang