You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Brad Heintz <br...@gmail.com> on 2009/09/11 23:06:38 UTC

Strange behavior during Hive queries

TIA if anyone can point me in the right direction on this.

I'm running a simple Hive query (a count on an external table comprising 436
files, each of ~2GB).  The cluster's mapred-site.xml specifies
mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
mappers spawned on each worker.

The problem:  When I run my Hive query, I see 2 mappers spawned per worker.

When I do "set -v;" from the Hive command line, I see
mapred.tasktracker.map.tasks.maximum = 7.

The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum =
7.

The only lead I have is that the default for
mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
in the cluster's mapred-site.xml I've tried redundanltly overriding this
variable everyplace I can think of (Hive command line with "-hiveconf",
using set from the Hive prompt, et al) and nothing works.  I've combed the
docs & mailing list, but haven't run across the answer.

Does anyone have any ideas what (if anything) I'm missing?  Is this some
quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
and I should just leave it alone?  Or is there some knob I can fiddle to get
it to use my cluster at full power?

Many thanks in advance,
- Brad

-- 
Brad Heintz
brad.heintz@gmail.com

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

No - 2 mappers per node, 7 nodes = 14 mappers total.  Most jobs use 7 per
node (49 total).

On Thu, Sep 17, 2009 at 12:56 AM, Zheng Shao <zs...@gmail.com> wrote:

> You mean 14 mappers running concurrently, correct?
> How many mappers in total for the hive query?
>
> Zheng
>
>
> On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <br...@gmail.com>wrote:
>
>> There are 14 mappers spawned when I do a Hive query - over 7 nodes.  Other
>> jobs spawn 7 nodes per mapper (total of 49), rather than 2.
>>
>> Block size is default.
>>
>> I'll try the "describe extended" as soon as I get a chance.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <at...@facebook.com>wrote:
>>
>>>  Can't seem to make head or tail of this. How many mappers does the job
>>> spaws? The explain plan seems to be fine. Can you also do a
>>>
>>> describe extended
>>>
>>> on both the input and the output table.
>>>
>>> Also what is the block size and how many hdfs nodes is this data spread
>>> over.
>>>
>>> Ashish
>>>  ------------------------------
>>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>>> *Sent:* Monday, September 14, 2009 1:23 PM
>>>
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* Re: Strange behavior during Hive queries
>>>
>>> 436 files, each about 2GB.
>>>
>>>
>>> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>>>  Currently, hive uses 1 mapper per file – does your table have lots of
>>>> small files ? If yes, it might be a good idea to concatenate them into fewer
>>>> files
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com]
>>>> *Sent:* Monday, September 14, 2009 12:17 PM
>>>> *To:* Brad Heintz; hive-user@hadoop.apache.org
>>>> *Subject:* RE: Strange behavior during Hive queries
>>>>
>>>>
>>>>
>>>>
>>>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>>>>
>>>> Related issue , hive used too many mappers for very small table.
>>>>
>>>>
>>>>  ------------------------------
>>>>
>>>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>>>> *Sent:* Monday, September 14, 2009 11:51 AM
>>>> *To:* hive-user@hadoop.apache.org
>>>> *Subject:* Re: Strange behavior during Hive queries
>>>>
>>>>
>>>>
>>>> Ashish -
>>>>
>>>> mapred.min.split.size is set to 0 (according to the job.xml).  The data
>>>> are stored as uncompressed text files.
>>>>
>>>> Plan is attached.  I've been over it and didn't find anything useful,
>>>> but I'm also new to Hive and don't claim to understand everything I'm
>>>> looking at.  If you have any insight, I'd be most grateful.
>>>>
>>>> Many thanks,
>>>> - Brad
>>>>
>>>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>
>>>> wrote:
>>>>
>>>> How is your data stored - sequencefiles, textfiles, compressed?? and
>>>> what are the value of mapred.min.split.size? Hive does not usually make a
>>>> decision on the number of mappers but it does try to make an estimate of the
>>>> number of reducers to use. Also if you send out the plan that would be
>>>> great.
>>>>
>>>>
>>>>
>>>> Ashish
>>>>
>>>>
>>>>  ------------------------------
>>>>
>>>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>>>> *Sent:* Sunday, September 13, 2009 9:36 AM
>>>> *To:* hive-user@hadoop.apache.org
>>>> *Subject:* Re: Strange behavior during Hive queries
>>>>
>>>> Edward -
>>>>
>>>> Yeah, I figured Hive had some decisions it made internally about how
>>>> many mappers & reducers it used, but this is acting on almost 1TB of data -
>>>> I don't see why it would use fewer mappers.  Also, this isn't a sort (which
>>>> would of course use only 1 reducer) - it's a straight count.
>>>>
>>>> Thanks,
>>>> - Brad
>>>>
>>>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>
>>>> wrote:
>>>>
>>>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>> > Hrm... sorry, I didn't read your original query closely enough.
>>>> >
>>>> > I'm not sure what could be causing this. The map.tasks.maximum
>>>> parameter
>>>> > shouldn't affect it at all - it only affects the number of slots on
>>>> the
>>>> > trackers.
>>>> >
>>>> > By any chance do you have mapred.max.maps.per.node set? This is a
>>>> > configuration parameter added by HADOOP-5170 - it's not in trunk or
>>>> the
>>>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3
>>>> release this
>>>> > parameter could cause the behavior you're seeing. However, it would
>>>> > certainly not default to 2, so I'd be surprised if that were it.
>>>> >
>>>> > -Todd
>>>> >
>>>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Todd -
>>>> >>
>>>> >> Of course; it makes sense that it would be that way.  But I'm still
>>>> left
>>>> >> wondering why, then, my Hive queries are only using 2 mappers per
>>>> task
>>>> >> tracker when other jobs use 7.  I've gone so far as to diff the
>>>> job.xml
>>>> >> files from a regular job and a Hive query, and didn't turn up
>>>> anything -
>>>> >> though clearly, it has to be something Hive is doing.
>>>> >>
>>>> >> Thanks,
>>>> >> - Brad
>>>> >>
>>>> >>
>>>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>> >>>
>>>> >>> Hi Brad,
>>>> >>>
>>>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>>>> >>>
>>>> >>> Hope that helps
>>>> >>> -Todd
>>>> >>>
>>>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <brad.heintz@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> TIA if anyone can point me in the right direction on this.
>>>> >>>>
>>>> >>>> I'm running a simple Hive query (a count on an external table
>>>> comprising
>>>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>>>> worker
>>>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar
>>>> myJob.jar...", I see 7
>>>> >>>> mappers spawned on each worker.
>>>> >>>>
>>>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> >>>> worker.
>>>> >>>>
>>>> >>>> When I do "set -v;" from the Hive command line, I see
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>> >>>>
>>>> >>>> The job.xml for the Hive query shows
>>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>> >>>>
>>>> >>>> The only lead I have is that the default for
>>>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>>>> overridden
>>>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>>>> this
>>>> >>>> variable everyplace I can think of (Hive command line with
>>>> "-hiveconf",
>>>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>>>> combed the
>>>> >>>> docs & mailing list, but haven't run across the answer.
>>>> >>>>
>>>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>>>> some
>>>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>>>> enough,
>>>> >>>> and I should just leave it alone?  Or is there some knob I can
>>>> fiddle to get
>>>> >>>> it to use my cluster at full power?
>>>> >>>>
>>>> >>>> Many thanks in advance,
>>>> >>>> - Brad
>>>> >>>>
>>>> >>>> --
>>>> >>>> Brad Heintz
>>>> >>>> brad.heintz@gmail.com
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Brad Heintz
>>>> >> brad.heintz@gmail.com
>>>> >
>>>> >
>>>>
>>>> Hive does adjust some map/reduce settings based on the job size. Some
>>>> tasks like a sort might only require one map/reduce to work as well.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com
>>>>
>>>
>>>
>>>
>>> --
>>> Brad Heintz
>>> brad.heintz@gmail.com
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Brad Heintz
brad.heintz@gmail.com

Re: Strange behavior during Hive queries

Posted by Zheng Shao <zs...@gmail.com>.

You mean 14 mappers running concurrently, correct?
How many mappers in total for the hive query?

Zheng

On Wed, Sep 16, 2009 at 6:50 AM, Brad Heintz <br...@gmail.com> wrote:

> There are 14 mappers spawned when I do a Hive query - over 7 nodes.  Other
> jobs spawn 7 nodes per mapper (total of 49), rather than 2.
>
> Block size is default.
>
> I'll try the "describe extended" as soon as I get a chance.
>
> Thanks,
> - Brad
>
>
> On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <at...@facebook.com>wrote:
>
>>  Can't seem to make head or tail of this. How many mappers does the job
>> spaws? The explain plan seems to be fine. Can you also do a
>>
>> describe extended
>>
>> on both the input and the output table.
>>
>> Also what is the block size and how many hdfs nodes is this data spread
>> over.
>>
>> Ashish
>>  ------------------------------
>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>> *Sent:* Monday, September 14, 2009 1:23 PM
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Strange behavior during Hive queries
>>
>> 436 files, each about 2GB.
>>
>>
>> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>>>  Currently, hive uses 1 mapper per file – does your table have lots of
>>> small files ? If yes, it might be a good idea to concatenate them into fewer
>>> files
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com]
>>> *Sent:* Monday, September 14, 2009 12:17 PM
>>> *To:* Brad Heintz; hive-user@hadoop.apache.org
>>> *Subject:* RE: Strange behavior during Hive queries
>>>
>>>
>>>
>>>
>>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>>>
>>> Related issue , hive used too many mappers for very small table.
>>>
>>>
>>>  ------------------------------
>>>
>>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>>> *Sent:* Monday, September 14, 2009 11:51 AM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* Re: Strange behavior during Hive queries
>>>
>>>
>>>
>>> Ashish -
>>>
>>> mapred.min.split.size is set to 0 (according to the job.xml).  The data
>>> are stored as uncompressed text files.
>>>
>>> Plan is attached.  I've been over it and didn't find anything useful, but
>>> I'm also new to Hive and don't claim to understand everything I'm looking
>>> at.  If you have any insight, I'd be most grateful.
>>>
>>> Many thanks,
>>> - Brad
>>>
>>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>
>>> wrote:
>>>
>>> How is your data stored - sequencefiles, textfiles, compressed?? and what
>>> are the value of mapred.min.split.size? Hive does not usually make a
>>> decision on the number of mappers but it does try to make an estimate of the
>>> number of reducers to use. Also if you send out the plan that would be
>>> great.
>>>
>>>
>>>
>>> Ashish
>>>
>>>
>>>  ------------------------------
>>>
>>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>>> *Sent:* Sunday, September 13, 2009 9:36 AM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* Re: Strange behavior during Hive queries
>>>
>>> Edward -
>>>
>>> Yeah, I figured Hive had some decisions it made internally about how many
>>> mappers & reducers it used, but this is acting on almost 1TB of data - I
>>> don't see why it would use fewer mappers.  Also, this isn't a sort (which
>>> would of course use only 1 reducer) - it's a straight count.
>>>
>>> Thanks,
>>> - Brad
>>>
>>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>> > Hrm... sorry, I didn't read your original query closely enough.
>>> >
>>> > I'm not sure what could be causing this. The map.tasks.maximum
>>> parameter
>>> > shouldn't affect it at all - it only affects the number of slots on the
>>> > trackers.
>>> >
>>> > By any chance do you have mapred.max.maps.per.node set? This is a
>>> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
>>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
>>> this
>>> > parameter could cause the behavior you're seeing. However, it would
>>> > certainly not default to 2, so I'd be surprised if that were it.
>>> >
>>> > -Todd
>>> >
>>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
>>> wrote:
>>> >>
>>> >> Todd -
>>> >>
>>> >> Of course; it makes sense that it would be that way.  But I'm still
>>> left
>>> >> wondering why, then, my Hive queries are only using 2 mappers per task
>>> >> tracker when other jobs use 7.  I've gone so far as to diff the
>>> job.xml
>>> >> files from a regular job and a Hive query, and didn't turn up anything
>>> -
>>> >> though clearly, it has to be something Hive is doing.
>>> >>
>>> >> Thanks,
>>> >> - Brad
>>> >>
>>> >>
>>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>
>>> wrote:
>>> >>>
>>> >>> Hi Brad,
>>> >>>
>>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>>> >>>
>>> >>> Hope that helps
>>> >>> -Todd
>>> >>>
>>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> TIA if anyone can point me in the right direction on this.
>>> >>>>
>>> >>>> I'm running a simple Hive query (a count on an external table
>>> comprising
>>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>>> worker
>>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...",
>>> I see 7
>>> >>>> mappers spawned on each worker.
>>> >>>>
>>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>> >>>> worker.
>>> >>>>
>>> >>>> When I do "set -v;" from the Hive command line, I see
>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>> >>>>
>>> >>>> The job.xml for the Hive query shows
>>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>>> >>>>
>>> >>>> The only lead I have is that the default for
>>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>>> overridden
>>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>>> this
>>> >>>> variable everyplace I can think of (Hive command line with
>>> "-hiveconf",
>>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>>> combed the
>>> >>>> docs & mailing list, but haven't run across the answer.
>>> >>>>
>>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>>> some
>>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>>> enough,
>>> >>>> and I should just leave it alone?  Or is there some knob I can
>>> fiddle to get
>>> >>>> it to use my cluster at full power?
>>> >>>>
>>> >>>> Many thanks in advance,
>>> >>>> - Brad
>>> >>>>
>>> >>>> --
>>> >>>> Brad Heintz
>>> >>>> brad.heintz@gmail.com
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Brad Heintz
>>> >> brad.heintz@gmail.com
>>> >
>>> >
>>>
>>> Hive does adjust some map/reduce settings based on the job size. Some
>>> tasks like a sort might only require one map/reduce to work as well.
>>>
>>>
>>>
>>>
>>> --
>>> Brad Heintz
>>> brad.heintz@gmail.com
>>>
>>>
>>>
>>>
>>> --
>>> Brad Heintz
>>> brad.heintz@gmail.com
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>



-- 
Yours,
Zheng

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

There are 14 mappers spawned when I do a Hive query - over 7 nodes.  Other
jobs spawn 7 nodes per mapper (total of 49), rather than 2.

Block size is default.

I'll try the "describe extended" as soon as I get a chance.

Thanks,
- Brad

On Tue, Sep 15, 2009 at 7:23 PM, Ashish Thusoo <at...@facebook.com> wrote:

>  Can't seem to make head or tail of this. How many mappers does the job
> spaws? The explain plan seems to be fine. Can you also do a
>
> describe extended
>
> on both the input and the output table.
>
> Also what is the block size and how many hdfs nodes is this data spread
> over.
>
> Ashish
>  ------------------------------
> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
> *Sent:* Monday, September 14, 2009 1:23 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Strange behavior during Hive queries
>
> 436 files, each about 2GB.
>
>
> On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <nj...@facebook.com> wrote:
>
>>  Currently, hive uses 1 mapper per file – does your table have lots of
>> small files ? If yes, it might be a good idea to concatenate them into fewer
>> files
>>
>>
>>
>>
>>
>> *From:* Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com]
>> *Sent:* Monday, September 14, 2009 12:17 PM
>> *To:* Brad Heintz; hive-user@hadoop.apache.org
>> *Subject:* RE: Strange behavior during Hive queries
>>
>>
>>
>>
>> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>>
>> Related issue , hive used too many mappers for very small table.
>>
>>
>>  ------------------------------
>>
>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>> *Sent:* Monday, September 14, 2009 11:51 AM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Strange behavior during Hive queries
>>
>>
>>
>> Ashish -
>>
>> mapred.min.split.size is set to 0 (according to the job.xml).  The data
>> are stored as uncompressed text files.
>>
>> Plan is attached.  I've been over it and didn't find anything useful, but
>> I'm also new to Hive and don't claim to understand everything I'm looking
>> at.  If you have any insight, I'd be most grateful.
>>
>> Many thanks,
>> - Brad
>>
>> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>
>> wrote:
>>
>> How is your data stored - sequencefiles, textfiles, compressed?? and what
>> are the value of mapred.min.split.size? Hive does not usually make a
>> decision on the number of mappers but it does try to make an estimate of the
>> number of reducers to use. Also if you send out the plan that would be
>> great.
>>
>>
>>
>> Ashish
>>
>>
>>  ------------------------------
>>
>> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
>> *Sent:* Sunday, September 13, 2009 9:36 AM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Strange behavior during Hive queries
>>
>> Edward -
>>
>> Yeah, I figured Hive had some decisions it made internally about how many
>> mappers & reducers it used, but this is acting on almost 1TB of data - I
>> don't see why it would use fewer mappers.  Also, this isn't a sort (which
>> would of course use only 1 reducer) - it's a straight count.
>>
>> Thanks,
>> - Brad
>>
>> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > Hrm... sorry, I didn't read your original query closely enough.
>> >
>> > I'm not sure what could be causing this. The map.tasks.maximum parameter
>> > shouldn't affect it at all - it only affects the number of slots on the
>> > trackers.
>> >
>> > By any chance do you have mapred.max.maps.per.node set? This is a
>> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
>> this
>> > parameter could cause the behavior you're seeing. However, it would
>> > certainly not default to 2, so I'd be surprised if that were it.
>> >
>> > -Todd
>> >
>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
>> wrote:
>> >>
>> >> Todd -
>> >>
>> >> Of course; it makes sense that it would be that way.  But I'm still
>> left
>> >> wondering why, then, my Hive queries are only using 2 mappers per task
>> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> >> files from a regular job and a Hive query, and didn't turn up anything
>> -
>> >> though clearly, it has to be something Hive is doing.
>> >>
>> >> Thanks,
>> >> - Brad
>> >>
>> >>
>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >>>
>> >>> Hi Brad,
>> >>>
>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>> >>>
>> >>> Hope that helps
>> >>> -Todd
>> >>>
>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> TIA if anyone can point me in the right direction on this.
>> >>>>
>> >>>> I'm running a simple Hive query (a count on an external table
>> comprising
>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>> worker
>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...",
>> I see 7
>> >>>> mappers spawned on each worker.
>> >>>>
>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>> >>>> worker.
>> >>>>
>> >>>> When I do "set -v;" from the Hive command line, I see
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The job.xml for the Hive query shows
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The only lead I have is that the default for
>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>> overridden
>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>> this
>> >>>> variable everyplace I can think of (Hive command line with
>> "-hiveconf",
>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>> combed the
>> >>>> docs & mailing list, but haven't run across the answer.
>> >>>>
>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>> some
>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>> enough,
>> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
>> to get
>> >>>> it to use my cluster at full power?
>> >>>>
>> >>>> Many thanks in advance,
>> >>>> - Brad
>> >>>>
>> >>>> --
>> >>>> Brad Heintz
>> >>>> brad.heintz@gmail.com
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Brad Heintz
>> >> brad.heintz@gmail.com
>> >
>> >
>>
>> Hive does adjust some map/reduce settings based on the job size. Some
>> tasks like a sort might only require one map/reduce to work as well.
>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>



-- 
Brad Heintz
brad.heintz@gmail.com

RE: Strange behavior during Hive queries

Posted by Ashish Thusoo <at...@facebook.com>.

Can't seem to make head or tail of this. How many mappers does the job spaws? The explain plan seems to be fine. Can you also do a

describe extended

on both the input and the output table.

Also what is the block size and how many hdfs nodes is this data spread over.

Ashish
________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com]
Sent: Monday, September 14, 2009 1:23 PM
To: hive-user@hadoop.apache.org
Subject: Re: Strange behavior during Hive queries

436 files, each about 2GB.


On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <nj...@facebook.com>> wrote:

Currently, hive uses 1 mapper per file - does your table have lots of small files ? If yes, it might be a good idea to concatenate them into fewer files





From: Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com<ma...@nominum.com>]
Sent: Monday, September 14, 2009 12:17 PM
To: Brad Heintz; hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: Strange behavior during Hive queries



http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers

Related issue , hive used too many mappers for very small table.



________________________________

From: Brad Heintz [mailto:brad.heintz@gmail.com<ma...@gmail.com>]
Sent: Monday, September 14, 2009 11:51 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Strange behavior during Hive queries



Ashish -

mapred.min.split.size is set to 0 (according to the job.xml).  The data are stored as uncompressed text files.

Plan is attached.  I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at.  If you have any insight, I'd be most grateful.

Many thanks,
- Brad

On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>> wrote:

How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.



Ashish



________________________________

From: Brad Heintz [mailto:brad.heintz@gmail.com<ma...@gmail.com>]
Sent: Sunday, September 13, 2009 9:36 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Strange behavior during Hive queries

Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>> wrote:

On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com<ma...@gmail.com>
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com<ma...@gmail.com>
>
>

Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

436 files, each about 2GB.


On Mon, Sep 14, 2009 at 4:02 PM, Namit Jain <nj...@facebook.com> wrote:

>  Currently, hive uses 1 mapper per file – does your table have lots of
> small files ? If yes, it might be a good idea to concatenate them into fewer
> files
>
>
>
>
>
> *From:* Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com]
> *Sent:* Monday, September 14, 2009 12:17 PM
> *To:* Brad Heintz; hive-user@hadoop.apache.org
> *Subject:* RE: Strange behavior during Hive queries
>
>
>
>
> http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
>
> Related issue , hive used too many mappers for very small table.
>
>
>  ------------------------------
>
> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
> *Sent:* Monday, September 14, 2009 11:51 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Strange behavior during Hive queries
>
>
>
> Ashish -
>
> mapred.min.split.size is set to 0 (according to the job.xml).  The data are
> stored as uncompressed text files.
>
> Plan is attached.  I've been over it and didn't find anything useful, but
> I'm also new to Hive and don't claim to understand everything I'm looking
> at.  If you have any insight, I'd be most grateful.
>
> Many thanks,
> - Brad
>
> On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>
> wrote:
>
> How is your data stored - sequencefiles, textfiles, compressed?? and what
> are the value of mapred.min.split.size? Hive does not usually make a
> decision on the number of mappers but it does try to make an estimate of the
> number of reducers to use. Also if you send out the plan that would be
> great.
>
>
>
> Ashish
>
>
>  ------------------------------
>
> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
> *Sent:* Sunday, September 13, 2009 9:36 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Strange behavior during Hive queries
>
> Edward -
>
> Yeah, I figured Hive had some decisions it made internally about how many
> mappers & reducers it used, but this is acting on almost 1TB of data - I
> don't see why it would use fewer mappers.  Also, this isn't a sort (which
> would of course use only 1 reducer) - it's a straight count.
>
> Thanks,
> - Brad
>
> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > Hrm... sorry, I didn't read your original query closely enough.
> >
> > I'm not sure what could be causing this. The map.tasks.maximum parameter
> > shouldn't affect it at all - it only affects the number of slots on the
> > trackers.
> >
> > By any chance do you have mapred.max.maps.per.node set? This is a
> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
> this
> > parameter could cause the behavior you're seeing. However, it would
> > certainly not default to 2, so I'd be surprised if that were it.
> >
> > -Todd
> >
> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
> wrote:
> >>
> >> Todd -
> >>
> >> Of course; it makes sense that it would be that way.  But I'm still left
> >> wondering why, then, my Hive queries are only using 2 mappers per task
> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
> >> files from a regular job and a Hive query, and didn't turn up anything -
> >> though clearly, it has to be something Hive is doing.
> >>
> >> Thanks,
> >> - Brad
> >>
> >>
> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >>>
> >>> Hi Brad,
> >>>
> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
> >>> TaskTracker when it starts up. It cannot be changed per-job.
> >>>
> >>> Hope that helps
> >>> -Todd
> >>>
> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
> >>> wrote:
> >>>>
> >>>> TIA if anyone can point me in the right direction on this.
> >>>>
> >>>> I'm running a simple Hive query (a count on an external table
> comprising
> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
> worker
> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I
> see 7
> >>>> mappers spawned on each worker.
> >>>>
> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
> >>>> worker.
> >>>>
> >>>> When I do "set -v;" from the Hive command line, I see
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The job.xml for the Hive query shows
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The only lead I have is that the default for
> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
> overridden
> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
> this
> >>>> variable everyplace I can think of (Hive command line with
> "-hiveconf",
> >>>> using set from the Hive prompt, et al) and nothing works.  I've combed
> the
> >>>> docs & mailing list, but haven't run across the answer.
> >>>>
> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
> some
> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
> enough,
> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
> to get
> >>>> it to use my cluster at full power?
> >>>>
> >>>> Many thanks in advance,
> >>>> - Brad
> >>>>
> >>>> --
> >>>> Brad Heintz
> >>>> brad.heintz@gmail.com
> >>>
> >>
> >>
> >>
> >> --
> >> Brad Heintz
> >> brad.heintz@gmail.com
> >
> >
>
> Hive does adjust some map/reduce settings based on the job size. Some
> tasks like a sort might only require one map/reduce to work as well.
>
>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>
>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>



-- 
Brad Heintz
brad.heintz@gmail.com

RE: Strange behavior during Hive queries

Posted by Namit Jain <nj...@facebook.com>.

Currently, hive uses 1 mapper per file - does your table have lots of small files ? If yes, it might be a good idea to concatenate them into fewer files


From: Ravi Jagannathan [mailto:Ravi.Jagannathan@nominum.com]
Sent: Monday, September 14, 2009 12:17 PM
To: Brad Heintz; hive-user@hadoop.apache.org
Subject: RE: Strange behavior during Hive queries

http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
Related issue , hive used too many mappers for very small table.

________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com]
Sent: Monday, September 14, 2009 11:51 AM
To: hive-user@hadoop.apache.org
Subject: Re: Strange behavior during Hive queries

Ashish -

mapred.min.split.size is set to 0 (according to the job.xml).  The data are stored as uncompressed text files.

Plan is attached.  I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at.  If you have any insight, I'd be most grateful.

Many thanks,
- Brad
On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>> wrote:
How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.

Ashish

________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com<ma...@gmail.com>]
Sent: Sunday, September 13, 2009 9:36 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Strange behavior during Hive queries
Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad
On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>> wrote:
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com<ma...@gmail.com>
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com<ma...@gmail.com>
>
>
Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>

RE: Strange behavior during Hive queries

Posted by Ravi Jagannathan <Ra...@nominum.com>.

http://getsatisfaction.com/cloudera/topics/how_to_decrease_the_number_of_mappers_not_reducers
Related issue , hive used too many mappers for very small table.

________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com]
Sent: Monday, September 14, 2009 11:51 AM
To: hive-user@hadoop.apache.org
Subject: Re: Strange behavior during Hive queries

Ashish -

mapred.min.split.size is set to 0 (according to the job.xml).  The data are stored as uncompressed text files.

Plan is attached.  I've been over it and didn't find anything useful, but I'm also new to Hive and don't claim to understand everything I'm looking at.  If you have any insight, I'd be most grateful.

Many thanks,
- Brad
On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com>> wrote:
How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.

Ashish

________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com<ma...@gmail.com>]
Sent: Sunday, September 13, 2009 9:36 AM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Strange behavior during Hive queries
Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad
On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>> wrote:
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com<ma...@gmail.com>
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com<ma...@gmail.com>
>
>
Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

Ashish -

mapred.min.split.size is set to 0 (according to the job.xml).  The data are
stored as uncompressed text files.

Plan is attached.  I've been over it and didn't find anything useful, but
I'm also new to Hive and don't claim to understand everything I'm looking
at.  If you have any insight, I'd be most grateful.

Many thanks,
- Brad

On Mon, Sep 14, 2009 at 2:29 PM, Ashish Thusoo <at...@facebook.com> wrote:

>  How is your data stored - sequencefiles, textfiles, compressed?? and what
> are the value of mapred.min.split.size? Hive does not usually make a
> decision on the number of mappers but it does try to make an estimate of the
> number of reducers to use. Also if you send out the plan that would be
> great.
>
> Ashish
>
>  ------------------------------
> *From:* Brad Heintz [mailto:brad.heintz@gmail.com]
> *Sent:* Sunday, September 13, 2009 9:36 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Strange behavior during Hive queries
>
> Edward -
>
> Yeah, I figured Hive had some decisions it made internally about how many
> mappers & reducers it used, but this is acting on almost 1TB of data - I
> don't see why it would use fewer mappers.  Also, this isn't a sort (which
> would of course use only 1 reducer) - it's a straight count.
>
> Thanks,
> - Brad
>
> On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>>  On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> > Hrm... sorry, I didn't read your original query closely enough.
>> >
>> > I'm not sure what could be causing this. The map.tasks.maximum parameter
>> > shouldn't affect it at all - it only affects the number of slots on the
>> > trackers.
>> >
>> > By any chance do you have mapred.max.maps.per.node set? This is a
>> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
>> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
>> this
>> > parameter could cause the behavior you're seeing. However, it would
>> > certainly not default to 2, so I'd be surprised if that were it.
>> >
>> > -Todd
>> >
>> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
>> wrote:
>> >>
>> >> Todd -
>> >>
>> >> Of course; it makes sense that it would be that way.  But I'm still
>> left
>> >> wondering why, then, my Hive queries are only using 2 mappers per task
>> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> >> files from a regular job and a Hive query, and didn't turn up anything
>> -
>> >> though clearly, it has to be something Hive is doing.
>> >>
>> >> Thanks,
>> >> - Brad
>> >>
>> >>
>> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >>>
>> >>> Hi Brad,
>> >>>
>> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>> >>> TaskTracker when it starts up. It cannot be changed per-job.
>> >>>
>> >>> Hope that helps
>> >>> -Todd
>> >>>
>> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> TIA if anyone can point me in the right direction on this.
>> >>>>
>> >>>> I'm running a simple Hive query (a count on an external table
>> comprising
>> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
>> worker
>> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...",
>> I see 7
>> >>>> mappers spawned on each worker.
>> >>>>
>> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>> >>>> worker.
>> >>>>
>> >>>> When I do "set -v;" from the Hive command line, I see
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The job.xml for the Hive query shows
>> >>>> mapred.tasktracker.map.tasks.maximum = 7.
>> >>>>
>> >>>> The only lead I have is that the default for
>> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
>> overridden
>> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
>> this
>> >>>> variable everyplace I can think of (Hive command line with
>> "-hiveconf",
>> >>>> using set from the Hive prompt, et al) and nothing works.  I've
>> combed the
>> >>>> docs & mailing list, but haven't run across the answer.
>> >>>>
>> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
>> some
>> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
>> enough,
>> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
>> to get
>> >>>> it to use my cluster at full power?
>> >>>>
>> >>>> Many thanks in advance,
>> >>>> - Brad
>> >>>>
>> >>>> --
>> >>>> Brad Heintz
>> >>>> brad.heintz@gmail.com
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Brad Heintz
>> >> brad.heintz@gmail.com
>> >
>> >
>>
>> Hive does adjust some map/reduce settings based on the job size. Some
>> tasks like a sort might only require one map/reduce to work as well.
>>
>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>



-- 
Brad Heintz
brad.heintz@gmail.com

RE: Strange behavior during Hive queries

Posted by Ashish Thusoo <at...@facebook.com>.

How is your data stored - sequencefiles, textfiles, compressed?? and what are the value of mapred.min.split.size? Hive does not usually make a decision on the number of mappers but it does try to make an estimate of the number of reducers to use. Also if you send out the plan that would be great.

Ashish

________________________________
From: Brad Heintz [mailto:brad.heintz@gmail.com]
Sent: Sunday, September 13, 2009 9:36 AM
To: hive-user@hadoop.apache.org
Subject: Re: Strange behavior during Hive queries

Edward -

Yeah, I figured Hive had some decisions it made internally about how many mappers & reducers it used, but this is acting on almost 1TB of data - I don't see why it would use fewer mappers.  Also, this isn't a sort (which would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>> wrote:
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com>> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com>> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com<ma...@gmail.com>
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com<ma...@gmail.com>
>
>

Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.



--
Brad Heintz
brad.heintz@gmail.com<ma...@gmail.com>

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

Edward -

Yeah, I figured Hive had some decisions it made internally about how many
mappers & reducers it used, but this is acting on almost 1TB of data - I
don't see why it would use fewer mappers.  Also, this isn't a sort (which
would of course use only 1 reducer) - it's a straight count.

Thanks,
- Brad

On Fri, Sep 11, 2009 at 5:30 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
> > Hrm... sorry, I didn't read your original query closely enough.
> >
> > I'm not sure what could be causing this. The map.tasks.maximum parameter
> > shouldn't affect it at all - it only affects the number of slots on the
> > trackers.
> >
> > By any chance do you have mapred.max.maps.per.node set? This is a
> > configuration parameter added by HADOOP-5170 - it's not in trunk or the
> > vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release
> this
> > parameter could cause the behavior you're seeing. However, it would
> > certainly not default to 2, so I'd be surprised if that were it.
> >
> > -Todd
> >
> > On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>
> wrote:
> >>
> >> Todd -
> >>
> >> Of course; it makes sense that it would be that way.  But I'm still left
> >> wondering why, then, my Hive queries are only using 2 mappers per task
> >> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
> >> files from a regular job and a Hive query, and didn't turn up anything -
> >> though clearly, it has to be something Hive is doing.
> >>
> >> Thanks,
> >> - Brad
> >>
> >>
> >> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >>>
> >>> Hi Brad,
> >>>
> >>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
> >>> TaskTracker when it starts up. It cannot be changed per-job.
> >>>
> >>> Hope that helps
> >>> -Todd
> >>>
> >>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
> >>> wrote:
> >>>>
> >>>> TIA if anyone can point me in the right direction on this.
> >>>>
> >>>> I'm running a simple Hive query (a count on an external table
> comprising
> >>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
> >>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per
> worker
> >>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I
> see 7
> >>>> mappers spawned on each worker.
> >>>>
> >>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
> >>>> worker.
> >>>>
> >>>> When I do "set -v;" from the Hive command line, I see
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The job.xml for the Hive query shows
> >>>> mapred.tasktracker.map.tasks.maximum = 7.
> >>>>
> >>>> The only lead I have is that the default for
> >>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's
> overridden
> >>>> in the cluster's mapred-site.xml I've tried redundanltly overriding
> this
> >>>> variable everyplace I can think of (Hive command line with
> "-hiveconf",
> >>>> using set from the Hive prompt, et al) and nothing works.  I've combed
> the
> >>>> docs & mailing list, but haven't run across the answer.
> >>>>
> >>>> Does anyone have any ideas what (if anything) I'm missing?  Is this
> some
> >>>> quirk of Hive, where it decides that 2 mappers per tasktracker is
> enough,
> >>>> and I should just leave it alone?  Or is there some knob I can fiddle
> to get
> >>>> it to use my cluster at full power?
> >>>>
> >>>> Many thanks in advance,
> >>>> - Brad
> >>>>
> >>>> --
> >>>> Brad Heintz
> >>>> brad.heintz@gmail.com
> >>>
> >>
> >>
> >>
> >> --
> >> Brad Heintz
> >> brad.heintz@gmail.com
> >
> >
>
> Hive does adjust some map/reduce settings based on the job size. Some
> tasks like a sort might only require one map/reduce to work as well.
>



-- 
Brad Heintz
brad.heintz@gmail.com

Re: Strange behavior during Hive queries

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com> wrote:
>>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>
>>> wrote:
>>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com
>>>
>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>
>

Hive does adjust some map/reduce settings based on the job size. Some
tasks like a sort might only require one map/reduce to work as well.

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

No, I'm using vanilla 0.20.0.  Other, non-Hive jobs are also running with
more mappers, so I don't think it'd be that setting even if I had it
available.

On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hrm... sorry, I didn't read your original query closely enough.
>
> I'm not sure what could be causing this. The map.tasks.maximum parameter
> shouldn't affect it at all - it only affects the number of slots on the
> trackers.
>
> By any chance do you have mapred.max.maps.per.node set? This is a
> configuration parameter added by HADOOP-5170 - it's not in trunk or the
> vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
> parameter could cause the behavior you're seeing. However, it would
> certainly not default to 2, so I'd be surprised if that were it.
>
> -Todd
>
>
> On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com>wrote:
>
>> Todd -
>>
>> Of course; it makes sense that it would be that way.  But I'm still left
>> wondering why, then, my Hive queries are only using 2 mappers per task
>> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
>> files from a regular job and a Hive query, and didn't turn up anything -
>> though clearly, it has to be something Hive is doing.
>>
>> Thanks,
>> - Brad
>>
>>
>>
>> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hi Brad,
>>>
>>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>>> TaskTracker when it starts up. It cannot be changed per-job.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>>
>>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>wrote:
>>>
>>>> TIA if anyone can point me in the right direction on this.
>>>>
>>>> I'm running a simple Hive query (a count on an external table comprising
>>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>>> mappers spawned on each worker.
>>>>
>>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>>> worker.
>>>>
>>>> When I do "set -v;" from the Hive command line, I see
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The job.xml for the Hive query shows
>>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>>
>>>> The only lead I have is that the default for
>>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>>> docs & mailing list, but haven't run across the answer.
>>>>
>>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>>> it to use my cluster at full power?
>>>>
>>>> Many thanks in advance,
>>>> - Brad
>>>>
>>>> --
>>>> Brad Heintz
>>>> brad.heintz@gmail.com
>>>>
>>>
>>>
>>
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>
>


-- 
Brad Heintz
brad.heintz@gmail.com

Re: Strange behavior during Hive queries

Posted by Todd Lipcon <to...@cloudera.com>.

Hrm... sorry, I didn't read your original query closely enough.

I'm not sure what could be causing this. The map.tasks.maximum parameter
shouldn't affect it at all - it only affects the number of slots on the
trackers.

By any chance do you have mapred.max.maps.per.node set? This is a
configuration parameter added by HADOOP-5170 - it's not in trunk or the
vanilla 0.18.3 release, but if you're running Cloudera's 0.18.3 release this
parameter could cause the behavior you're seeing. However, it would
certainly not default to 2, so I'd be surprised if that were it.

-Todd

On Fri, Sep 11, 2009 at 2:20 PM, Brad Heintz <br...@gmail.com> wrote:

> Todd -
>
> Of course; it makes sense that it would be that way.  But I'm still left
> wondering why, then, my Hive queries are only using 2 mappers per task
> tracker when other jobs use 7.  I've gone so far as to diff the job.xml
> files from a regular job and a Hive query, and didn't turn up anything -
> though clearly, it has to be something Hive is doing.
>
> Thanks,
> - Brad
>
>
>
> On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Brad,
>>
>> mapred.tasktracker.map.tasks.maximum is a parameter read by the
>> TaskTracker when it starts up. It cannot be changed per-job.
>>
>> Hope that helps
>> -Todd
>>
>>
>> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>wrote:
>>
>>> TIA if anyone can point me in the right direction on this.
>>>
>>> I'm running a simple Hive query (a count on an external table comprising
>>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>>> mappers spawned on each worker.
>>>
>>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>>> worker.
>>>
>>> When I do "set -v;" from the Hive command line, I see
>>> mapred.tasktracker.map.tasks.maximum = 7.
>>>
>>> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum
>>> = 7.
>>>
>>> The only lead I have is that the default for
>>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>>> variable everyplace I can think of (Hive command line with "-hiveconf",
>>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>>> docs & mailing list, but haven't run across the answer.
>>>
>>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>>> it to use my cluster at full power?
>>>
>>> Many thanks in advance,
>>> - Brad
>>>
>>> --
>>> Brad Heintz
>>> brad.heintz@gmail.com
>>>
>>
>>
>
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>

Re: Strange behavior during Hive queries

Posted by Brad Heintz <br...@gmail.com>.

Todd -

Of course; it makes sense that it would be that way.  But I'm still left
wondering why, then, my Hive queries are only using 2 mappers per task
tracker when other jobs use 7.  I've gone so far as to diff the job.xml
files from a regular job and a Hive query, and didn't turn up anything -
though clearly, it has to be something Hive is doing.

Thanks,
- Brad


On Fri, Sep 11, 2009 at 5:16 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Brad,
>
> mapred.tasktracker.map.tasks.maximum is a parameter read by the TaskTracker
> when it starts up. It cannot be changed per-job.
>
> Hope that helps
> -Todd
>
>
> On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com>wrote:
>
>> TIA if anyone can point me in the right direction on this.
>>
>> I'm running a simple Hive query (a count on an external table comprising
>> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
>> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
>> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
>> mappers spawned on each worker.
>>
>> The problem:  When I run my Hive query, I see 2 mappers spawned per
>> worker.
>>
>> When I do "set -v;" from the Hive command line, I see
>> mapred.tasktracker.map.tasks.maximum = 7.
>>
>> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum
>> = 7.
>>
>> The only lead I have is that the default for
>> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
>> in the cluster's mapred-site.xml I've tried redundanltly overriding this
>> variable everyplace I can think of (Hive command line with "-hiveconf",
>> using set from the Hive prompt, et al) and nothing works.  I've combed the
>> docs & mailing list, but haven't run across the answer.
>>
>> Does anyone have any ideas what (if anything) I'm missing?  Is this some
>> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
>> and I should just leave it alone?  Or is there some knob I can fiddle to get
>> it to use my cluster at full power?
>>
>> Many thanks in advance,
>> - Brad
>>
>> --
>> Brad Heintz
>> brad.heintz@gmail.com
>>
>
>


-- 
Brad Heintz
brad.heintz@gmail.com

Re: Strange behavior during Hive queries

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Brad,

mapred.tasktracker.map.tasks.maximum is a parameter read by the TaskTracker
when it starts up. It cannot be changed per-job.

Hope that helps
-Todd

On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz <br...@gmail.com> wrote:

> TIA if anyone can point me in the right direction on this.
>
> I'm running a simple Hive query (a count on an external table comprising
> 436 files, each of ~2GB).  The cluster's mapred-site.xml specifies
> mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker
> node.  When I run regular MR jobs via "bin/hadoop jar myJob.jar...", I see 7
> mappers spawned on each worker.
>
> The problem:  When I run my Hive query, I see 2 mappers spawned per worker.
>
> When I do "set -v;" from the Hive command line, I see
> mapred.tasktracker.map.tasks.maximum = 7.
>
> The job.xml for the Hive query shows mapred.tasktracker.map.tasks.maximum =
> 7.
>
> The only lead I have is that the default for
> mapred.tasktracker.map.tasks.maximum is 2, and even though it's overridden
> in the cluster's mapred-site.xml I've tried redundanltly overriding this
> variable everyplace I can think of (Hive command line with "-hiveconf",
> using set from the Hive prompt, et al) and nothing works.  I've combed the
> docs & mailing list, but haven't run across the answer.
>
> Does anyone have any ideas what (if anything) I'm missing?  Is this some
> quirk of Hive, where it decides that 2 mappers per tasktracker is enough,
> and I should just leave it alone?  Or is there some knob I can fiddle to get
> it to use my cluster at full power?
>
> Many thanks in advance,
> - Brad
>
> --
> Brad Heintz
> brad.heintz@gmail.com
>