You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Gourav Sengupta <go...@gmail.com> on 2013/10/09 18:58:42 UTC

Single Mapper - HIVE 0.11

Hi,

I am trying to run a join using two tables stored in ORC file format.

The first table has 34 million records and the second has around 300,000
records.

Setting "set hive.auto.convert.join=true" makes the entire query run via a
single mapper.
In case I am setting "set hive.auto.convert.join=false" then there are two
mappers first one reads the second table and then the entire large table
goes through the second mapper.

Is there something that I am doing wrong because there are three nodes in
the HADOOP cluster currently and I was expecting that at least 6 mappers
should have been used.

Thanks and Regards,
Gourav

Re: Single Mapper - HIVE 0.11

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

The smaller the stripe size, smaller will be the HDFS block size and there will be more mappers. By default ORC chooses HDFS block size as 2 times stripe size. 

Thanks
Prasanth Jayachandran

On Oct 11, 2013, at 1:42 AM, Gourav Sengupta <go...@gmail.com> wrote:

> Hi,
> 
> I regenerated the entire base data, and added the following configuration
> changes to hive-site.xml and mapred-site.xml and now there are multiple
> mappers and reducers running for the same query. I am still not quite sure
> how to go about using ORC file stripe size for increasing the number of
> mappers.
> 
> Is there any other performance optimization that I could have done? Please
> do advice.
> 
> =======================================
> mapred-site.xml and hive-site.xml changes:
> ---------------------------------------------------------------
> <property>
>  <name>mapred.map.child.java.opts</name>
>  <value>-Xmx1024M</value>
> </property>
> <property>
>  <name>mapred.reduce.child.java.opts</name>
>  <value>-Xmx1024M</value>
> </property>
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx1024M</value>
>  <description>setting memory for child jobs</description>
> </property>
> =======================================
> 
> =======================================
> hive run-time configuration:
> ---------------------------------------
> "set mapred.reduce.tasks=4"
> "set hive.auto.convert.join=true"
> =======================================
> 
> 
> Thanks and Regards,
> Gourav Sengupta
> 
> 
> 
> On Thu, Oct 10, 2013 at 9:16 AM, Gourav Sengupta <go...@gmail.com>wrote:
> 
>> Hi,
>> 
>> The entire table of 34 million records is in a single ORC file. and its
>> around 7 GB in size. the other ORC file is a dimension table with less than
>> 40 MB of records once again in a single ORC file.
>> 
>> I do not remember setting anywhere ORC file stripe size.
>> 
>> The problem that I am facing is the query is triggering only a single
>> mapper though the cluster has three nodes. Unlike other posts here I need
>> more mappers.
>> 
>> The other mentioned properties are mentioned below from the job xml file:
>> 
>> <property><name>mapred.min.split.size.per.node</name><value>1</value></property>
>> and
>> 
>> <property><name>mapred.max.split.size</name><value>256000000</value></property>
>> 
>> I am sure that there is no issue with HADOOP configuration as with some
>> other queries I am getting more than 24 mappers.
>> 
>> Please accept my sincere regards for your kind help and insights.
>> 
>> 
>> Thanks,
>> Gourav Sengupta
>> 
>> 
>> 
>> On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran <
>> pjayachandran@hortonworks.com> wrote:
>> 
>>> What is your ORC file stripe size? How many ORC files are there in each
>>> of the tables? It could be possible that ORC compressed the file so much
>>> that the file size is less than the HDFS block size. Can you please report
>>> the file size of the two ORC files?
>>> 
>>> Another possibility is that there are many small files. In that case by
>>> default hive uses CombineHiveInputFormat which combines many small files
>>> into a single large file. Hence you will see less number of mappers. If you
>>> are expecting one mapper per hdfs file, then try disabling
>>> CombineHiveInputFormat by "set
>>> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another
>>> way to control the number of mappers is by adjusting the min and max split
>>> size.
>>> 
>>> Thanks
>>> Prasanth Jayachandran
>>> 
>>> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <ni...@gmail.com> wrote:
>>> 
>>>> whats the size of the table? (in GBs? )
>>>> 
>>>> Whats the max and min split sizes have you provied?
>>>> 
>>>> 
>>>> On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <
>>> gourav.hadoop@gmail.com>wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am trying to run a join using two tables stored in ORC file format.
>>>>> 
>>>>> The first table has 34 million records and the second has around
>>> 300,000
>>>>> records.
>>>>> 
>>>>> Setting "set hive.auto.convert.join=true" makes the entire query run
>>> via a
>>>>> single mapper.
>>>>> In case I am setting "set hive.auto.convert.join=false" then there are
>>> two
>>>>> mappers first one reads the second table and then the entire large
>>> table
>>>>> goes through the second mapper.
>>>>> 
>>>>> Is there something that I am doing wrong because there are three nodes
>>> in
>>>>> the HADOOP cluster currently and I was expecting that at least 6
>>> mappers
>>>>> should have been used.
>>>>> 
>>>>> Thanks and Regards,
>>>>> Gourav
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Nitin Pawar
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately
>>> and delete it from your system. Thank You.
>>> 
>> 
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Single Mapper - HIVE 0.11

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I regenerated the entire base data, and added the following configuration
changes to hive-site.xml and mapred-site.xml and now there are multiple
mappers and reducers running for the same query. I am still not quite sure
how to go about using ORC file stripe size for increasing the number of
mappers.

Is there any other performance optimization that I could have done? Please
do advice.

=======================================
mapred-site.xml and hive-site.xml changes:
---------------------------------------------------------------
<property>
  <name>mapred.map.child.java.opts</name>
  <value>-Xmx1024M</value>
</property>
<property>
  <name>mapred.reduce.child.java.opts</name>
  <value>-Xmx1024M</value>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024M</value>
  <description>setting memory for child jobs</description>
</property>
=======================================

=======================================
hive run-time configuration:
---------------------------------------
"set mapred.reduce.tasks=4"
"set hive.auto.convert.join=true"
=======================================


Thanks and Regards,
Gourav Sengupta



On Thu, Oct 10, 2013 at 9:16 AM, Gourav Sengupta <go...@gmail.com>wrote:

> Hi,
>
> The entire table of 34 million records is in a single ORC file. and its
> around 7 GB in size. the other ORC file is a dimension table with less than
> 40 MB of records once again in a single ORC file.
>
> I do not remember setting anywhere ORC file stripe size.
>
> The problem that I am facing is the query is triggering only a single
> mapper though the cluster has three nodes. Unlike other posts here I need
> more mappers.
>
> The other mentioned properties are mentioned below from the job xml file:
>
> <property><name>mapred.min.split.size.per.node</name><value>1</value></property>
> and
>
> <property><name>mapred.max.split.size</name><value>256000000</value></property>
>
> I am sure that there is no issue with HADOOP configuration as with some
> other queries I am getting more than 24 mappers.
>
> Please accept my sincere regards for your kind help and insights.
>
>
> Thanks,
> Gourav Sengupta
>
>
>
> On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> What is your ORC file stripe size? How many ORC files are there in each
>> of the tables? It could be possible that ORC compressed the file so much
>> that the file size is less than the HDFS block size. Can you please report
>> the file size of the two ORC files?
>>
>> Another possibility is that there are many small files. In that case by
>> default hive uses CombineHiveInputFormat which combines many small files
>> into a single large file. Hence you will see less number of mappers. If you
>> are expecting one mapper per hdfs file, then try disabling
>> CombineHiveInputFormat by "set
>> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another
>> way to control the number of mappers is by adjusting the min and max split
>> size.
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <ni...@gmail.com> wrote:
>>
>> > whats the size of the table? (in GBs? )
>> >
>> > Whats the max and min split sizes have you provied?
>> >
>> >
>> > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <
>> gourav.hadoop@gmail.com>wrote:
>> >
>> >> Hi,
>> >>
>> >> I am trying to run a join using two tables stored in ORC file format.
>> >>
>> >> The first table has 34 million records and the second has around
>> 300,000
>> >> records.
>> >>
>> >> Setting "set hive.auto.convert.join=true" makes the entire query run
>> via a
>> >> single mapper.
>> >> In case I am setting "set hive.auto.convert.join=false" then there are
>> two
>> >> mappers first one reads the second table and then the entire large
>> table
>> >> goes through the second mapper.
>> >>
>> >> Is there something that I am doing wrong because there are three nodes
>> in
>> >> the HADOOP cluster currently and I was expecting that at least 6
>> mappers
>> >> should have been used.
>> >>
>> >> Thanks and Regards,
>> >> Gourav
>> >>
>> >
>> >
>> >
>> > --
>> > Nitin Pawar
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

Re: Single Mapper - HIVE 0.11

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

The entire table of 34 million records is in a single ORC file. and its
around 7 GB in size. the other ORC file is a dimension table with less than
40 MB of records once again in a single ORC file.

I do not remember setting anywhere ORC file stripe size.

The problem that I am facing is the query is triggering only a single
mapper though the cluster has three nodes. Unlike other posts here I need
more mappers.

The other mentioned properties are mentioned below from the job xml file:
<property><name>mapred.min.split.size.per.node</name><value>1</value></property>
and
<property><name>mapred.max.split.size</name><value>256000000</value></property>

I am sure that there is no issue with HADOOP configuration as with some
other queries I am getting more than 24 mappers.

Please accept my sincere regards for your kind help and insights.


Thanks,
Gourav Sengupta



On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> What is your ORC file stripe size? How many ORC files are there in each of
> the tables? It could be possible that ORC compressed the file so much that
> the file size is less than the HDFS block size. Can you please report the
> file size of the two ORC files?
>
> Another possibility is that there are many small files. In that case by
> default hive uses CombineHiveInputFormat which combines many small files
> into a single large file. Hence you will see less number of mappers. If you
> are expecting one mapper per hdfs file, then try disabling
> CombineHiveInputFormat by "set
> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another
> way to control the number of mappers is by adjusting the min and max split
> size.
>
> Thanks
> Prasanth Jayachandran
>
> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <ni...@gmail.com> wrote:
>
> > whats the size of the table? (in GBs? )
> >
> > Whats the max and min split sizes have you provied?
> >
> >
> > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <
> gourav.hadoop@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> I am trying to run a join using two tables stored in ORC file format.
> >>
> >> The first table has 34 million records and the second has around 300,000
> >> records.
> >>
> >> Setting "set hive.auto.convert.join=true" makes the entire query run
> via a
> >> single mapper.
> >> In case I am setting "set hive.auto.convert.join=false" then there are
> two
> >> mappers first one reads the second table and then the entire large table
> >> goes through the second mapper.
> >>
> >> Is there something that I am doing wrong because there are three nodes
> in
> >> the HADOOP cluster currently and I was expecting that at least 6 mappers
> >> should have been used.
> >>
> >> Thanks and Regards,
> >> Gourav
> >>
> >
> >
> >
> > --
> > Nitin Pawar
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Single Mapper - HIVE 0.11

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

What is your ORC file stripe size? How many ORC files are there in each of the tables? It could be possible that ORC compressed the file so much that the file size is less than the HDFS block size. Can you please report the file size of the two ORC files?

Another possibility is that there are many small files. In that case by default hive uses CombineHiveInputFormat which combines many small files into a single large file. Hence you will see less number of mappers. If you are expecting one mapper per hdfs file, then try disabling CombineHiveInputFormat by "set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another way to control the number of mappers is by adjusting the min and max split size.

Thanks
Prasanth Jayachandran

On Oct 9, 2013, at 10:03 AM, Nitin Pawar <ni...@gmail.com> wrote:

> whats the size of the table? (in GBs? )
> 
> Whats the max and min split sizes have you provied?
> 
> 
> On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <go...@gmail.com>wrote:
> 
>> Hi,
>> 
>> I am trying to run a join using two tables stored in ORC file format.
>> 
>> The first table has 34 million records and the second has around 300,000
>> records.
>> 
>> Setting "set hive.auto.convert.join=true" makes the entire query run via a
>> single mapper.
>> In case I am setting "set hive.auto.convert.join=false" then there are two
>> mappers first one reads the second table and then the entire large table
>> goes through the second mapper.
>> 
>> Is there something that I am doing wrong because there are three nodes in
>> the HADOOP cluster currently and I was expecting that at least 6 mappers
>> should have been used.
>> 
>> Thanks and Regards,
>> Gourav
>> 
> 
> 
> 
> -- 
> Nitin Pawar

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Single Mapper - HIVE 0.11

Posted by Nitin Pawar <ni...@gmail.com>.

whats the size of the table? (in GBs? )

Whats the max and min split sizes have you provied?


On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <go...@gmail.com>wrote:

> Hi,
>
> I am trying to run a join using two tables stored in ORC file format.
>
> The first table has 34 million records and the second has around 300,000
> records.
>
> Setting "set hive.auto.convert.join=true" makes the entire query run via a
> single mapper.
> In case I am setting "set hive.auto.convert.join=false" then there are two
> mappers first one reads the second table and then the entire large table
> goes through the second mapper.
>
> Is there something that I am doing wrong because there are three nodes in
> the HADOOP cluster currently and I was expecting that at least 6 mappers
> should have been used.
>
> Thanks and Regards,
> Gourav
>



-- 
Nitin Pawar