You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by javateck javateck <ja...@gmail.com> on 2009/04/18 11:41:21 UTC

hive standalone machine mapreduce can only run one query on the same partition

I have two standalone machines running on the same box, each of course
running on a different port. And these two instances each has its own table
space (by starting hive from different directories), but they have the same
table structure, both of them have the same partitions also, and they are
pointing to the same hadoop directory, for example, pointing to the same
/user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
When I run two hive queries on the same partition, only one query can
proceed, the other is hanging there and can proceed only after the first one
is done.

Question on this:
1. is this the expected behavior? and if so, how go around this?

2. if I have two machines, each running on different machine, but sharing
the same hadoop directory (sharing single table store), can I query the same
table for the same partition from the two machines simultaneously?


thanks

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by javateck javateck <ja...@gmail.com>.
below is that I got for one mapred

*Kind*

*% Complete*

*Num Tasks*

*Pending*

*Running*

*Complete*

*Killed*

*Failed/Killed*<http://etsx18.apple.com:50030/jobfailures.jsp?jobid=job_200904210001_0036>

*Task Attempts*<http://etsx18.apple.com:50030/jobfailures.jsp?jobid=job_200904210001_0036>

*map*<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=map&pagenum=1>

55.96%



   189

79<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=map&pagenum=1&state=pending>

5<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=map&pagenum=1&state=running>

105<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=map&pagenum=1&state=completed>

0

0 / 0

*reduce*<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=reduce&pagenum=1>

0.00%


   1

1<http://etsx18.apple.com:50030/jobtasks.jsp?jobid=job_200904210001_0036&type=reduce&pagenum=1&state=pending>

0

0

0

0 / 0

We can see that there many map tasks running, so in this case, it will use
up all slots, and it will not help to speed the mapred speed if we just have
small number of nodes, I guess.



On Mon, Apr 20, 2009 at 5:29 PM, Raghu Murthy <rm...@facebook.com> wrote:

> Running multiple servers may be useful if you want to distribute the load
> of
> compiling queries.
>
> Regarding your setup, can you find out how many mappers were invoked for
> your query by looking at the jobtracker webui or the logs?
>
> On 4/20/09 4:36 PM, "javateck javateck" <ja...@gmail.com> wrote:
>
> > Hi, Raghu,
> >
> > the other question I have is that if I run multiple instances of hive
> server,
> > will this improve overall performance? since I'm thinking multiple
> instances
> > could spread the queries.
> >
> > thanks,
> >
> >
> > On Mon, Apr 20, 2009 at 3:57 PM, javateck javateck <ja...@gmail.com>
> wrote:
> >> Hi Raghu,
> >>
> >>   Machine 1 is the namenode+datanode+jobtracker+tasktracker, machine 2
> is the
> >> datanode+tasktracker. Each machine's tasktracker can run 2 map tasks at
> one
> >> time (which is shown below). Actually I'm not sure if the tasktracker
> used up
> >> the mapred slots, not sure how hive is utilizing it, below I attached
> one log
> >> segment.
> >>
> >> thanks
> >>
> >> below is my hadoop settings, I just use default now:
> >> ################################################
> >> <property>
> >>   <name>mapred.tasktracker.map.tasks.maximum</name>
> >>   <value>2</value>
> >>   <description>The maximum number of map tasks that will be run
> >>   simultaneously by a task tracker.
> >>   </description>
> >> </property>
> >>
> >> <property>
> >>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
> >>   <value>2</value>
> >>   <description>The maximum number of reduce tasks that will be run
> >>   simultaneously by a task tracker.
> >>   </description>
> >> </property>
> >> ################################################
> >>
> >> And below is one segment of the output when I kicked off the query:
> >>
> >> ################################################
> >> 09/04/20 22:49:49 INFO ql.Driver: Semantic Analysis Completed
> >> Total MapReduce jobs = 1
> >> 09/04/20 22:49:49 INFO ql.Driver: Total MapReduce jobs = 1
> >> Number of reduce tasks determined at compile time: 1
> >> 09/04/20 22:49:49 INFO exec.ExecDriver: Number of reduce tasks
> determined at
> >> compile time: 1
> >> In order to change the average load for a reducer (in bytes):
> >> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to change the average
> load
> >> for a reducer (in bytes):
> >>   set hive.exec.reducers.bytes.per.reducer=<number>
> >> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
> >> hive.exec.reducers.bytes.per.reducer=<number>
> >> In order to limit the maximum number of reducers:
> >> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to limit the maximum
> number
> >> of reducers:
> >>   set hive.exec.reducers.max=<number>
> >> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
> hive.exec.reducers.max=<number>
> >> In order to set a constant number of reducers:
> >> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to set a constant
> number of
> >> reducers:
> >>   set mapred.reduce.tasks=<number>
> >> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
> mapred.reduce.tasks=<number>
> >> ################################################
> >>
> >>
> >> On Mon, Apr 20, 2009 at 3:26 PM, Raghu Murthy <rm...@facebook.com>
> wrote:
> >>> In our setup we have 10s of people running queries simultaneously and
> we are
> >>> not seeing this problem.
> >>>
> >>> Are machine1 and machine2 part of the same hadoop map-reduce cluster as
> well
> >>> (you mentioned that they are point to the same hadoop instance, which
> >>> indicates that they use the same HDFS instance)? If so, how many
> map/reduce
> >>> slots have you configured each of these machines with? And, does your
> query
> >>> use up all the available slots? If that is the case, your second query
> will
> >>> 'hang' because its waiting for slots to become free.
> >>>
> >>> On 4/20/09 3:16 PM, "javateck javateck" <ja...@gmail.com> wrote:
> >>>
> >>>>> It seems that hive is not allowing running multiple queries to the
> same
> >>>> set of
> >>>>> data at a time even I have two hive instances, each server running on
> a
> >>>>> separate machine.
> >>>>>
> >>>>> My set up is like following:
> >>>>> 1. On machine 1, I installed a standalone derby server and started
> it, and
> >>>>> also started a hive standalone server on machine 1, and I created a
> table,
> >>>>> let's call it testTB
> >>>>> 2. On machine 2, I did the same thing as machine 1. And I also
> created a
> >>>> table
> >>>>> testTB, which is having a same structure as that on machine 1
> >>>>> 3. I load some data with a specific partition (for example,
> 2009-04-20)
> >>>>> through machine 1 to testTB, and on machine 2, I just add a partition
> on
> >>>>> >>> the
> >>>>> table testTB since machine 1&2 are pointing to the same hadoop
> instance,
> >>>> query
> >>>>> is like "alter table testTB add partition (dt='2009-04-20')". And I
> >>>> checked
> >>>>> that query on machine 1 & 2 were working when querying the newly
> loaded
> >>>> data
> >>>>> 4. I issued a query through JDBC to query the newly specific
> partition on
> >>>>> machine 1, and issued anther query to machine 2 on the same partition
> at
> >>>>> >>> the
> >>>>> same time.
> >>>>> 5. I find that only one query was running when I saw the screen on
> machine
> >>>>> >>> 1 &
> >>>>> 2, the other was hanging.
> >>>>>
> >>>>> Can someone confirm is this the behavior by design? The reason I'm
> doing
> >>>> this
> >>>>> is to be able to run multiple queries at the same time in parallel
> since
> >>>>> >>> one
> >>>>> hive instance is not supporting multi-threading currently. Currently
> I'm
> >>>> stuck
> >>>>> because I have to run queries sequentially which will be taking
> hours.
> >>>>>
> >>>>> thanks
> >>>>>
> >>>>> On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <
> javateck@gmail.com>
> >>>> wrote:
> >>>>>>> thanks, guys, that's working for single table space if starting a
> derby
> >>>>>>> db
> >>>>>>> server. Now my issue is that if I have two hive standalone
> instances
> >>>>> running
> >>>>>>> on the same machine, it seems that only one instance can run the
> >>>>> mapreduce
> >>>>>>> jobs at one time, the other is just hanging there,  I don't know if
> it's
> >>>>>>> limited by the default settings that only allow two mappers working
> at
> >>>>>>> >>>> the
> >>>>>>> same time by one tasktracker that's limited by hadoop side.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <
> athusoo@facebook.com>
> >>>>> wrote:
> >>>>>>>>> This should work. We have users running queries from their own
> >>>>>> physical
> >>>>>>>>> machines and sometimes from different hive directories as well.
> My
> >>>>>> guess is
> >>>>>>>>> that the configuration in one of the conf files is not correct or
> >>>>>>>>> HADOOP_HOME
> >>>>>>>>> is not set properly. Are you using a local metastore or do you
> run it
> >>>>>>>>> >>>>> as a
> >>>>>>>>> server or have it backed by a mysql/Derby server?
> >>>>>>>>>
> >>>>>>>>> Ashish
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ________________________________________
> >>>>>>>>> From: javateck javateck [javateck@gmail.com]
> >>>>>>>>> Sent: Saturday, April 18, 2009 2:41 AM
> >>>>>>>>> To: hive-user@hadoop.apache.org
> >>>>>>>>> Subject: hive standalone machine mapreduce can only run one query
> on
> >>>>>>>>> the
> >>>>>>>>> same
> >>>>>>>>>   partition
> >>>>>>>>>
> >>>>>>>>> I have two standalone machines running on the same box, each of
> course
> >>>>>>>>> running on a different port. And these two instances each has its
> own
> >>>>>>>>> >>>>> table
> >>>>>>>>> space (by starting hive from different directories), but they
> have the
> >>>>>>>>> >>>>> same
> >>>>>>>>> table structure, both of them have the same partitions also, and
> they
> >>>>>>>>> are
> >>>>>>>>> pointing to the same hadoop directory, for example, pointing to
> the
> >>>>>>>>> >>>>> same
> >>>>>>>>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
> >>>>>>>>>
> >>>>>>>>> When I run two hive queries on the same partition, only one query
> can
> >>>>>>>>> proceed, the other is hanging there and can proceed only after
> the
> >>>>>> first one
> >>>>>>>>> is done.
> >>>>>>>>>
> >>>>>>>>> Question on this:
> >>>>>>>>> 1. is this the expected behavior? and if so, how go around this?
> >>>>>>>>>
> >>>>>>>>> 2. if I have two machines, each running on different machine, but
> >>>>>> sharing
> >>>>>>>>> the
> >>>>>>>>> same hadoop directory (sharing single table store), can I query
> the
> >>>>>>>>> >>>>> same
> >>>>>>>>> table for the same partition from the two machines
> simultaneously?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> thanks
> >>>>>>>
> >>>>>
> >>>
> >>
> >
>
>

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by Raghu Murthy <rm...@facebook.com>.
Running multiple servers may be useful if you want to distribute the load of
compiling queries.

Regarding your setup, can you find out how many mappers were invoked for
your query by looking at the jobtracker webui or the logs?

On 4/20/09 4:36 PM, "javateck javateck" <ja...@gmail.com> wrote:

> Hi, Raghu,
> 
> the other question I have is that if I run multiple instances of hive server,
> will this improve overall performance? since I'm thinking multiple instances
> could spread the queries.
> 
> thanks,
> 
> 
> On Mon, Apr 20, 2009 at 3:57 PM, javateck javateck <ja...@gmail.com> wrote:
>> Hi Raghu,
>> 
>>   Machine 1 is the namenode+datanode+jobtracker+tasktracker, machine 2 is the
>> datanode+tasktracker. Each machine's tasktracker can run 2 map tasks at one
>> time (which is shown below). Actually I'm not sure if the tasktracker used up
>> the mapred slots, not sure how hive is utilizing it, below I attached one log
>> segment.
>> 
>> thanks
>> 
>> below is my hadoop settings, I just use default now:
>> ################################################
>> <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>2</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>   <value>2</value>
>>   <description>The maximum number of reduce tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> ################################################
>> 
>> And below is one segment of the output when I kicked off the query:
>> 
>> ################################################
>> 09/04/20 22:49:49 INFO ql.Driver: Semantic Analysis Completed
>> Total MapReduce jobs = 1
>> 09/04/20 22:49:49 INFO ql.Driver: Total MapReduce jobs = 1
>> Number of reduce tasks determined at compile time: 1
>> 09/04/20 22:49:49 INFO exec.ExecDriver: Number of reduce tasks determined at
>> compile time: 1
>> In order to change the average load for a reducer (in bytes):
>> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to change the average load
>> for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
>> hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to limit the maximum number
>> of reducers:
>>   set hive.exec.reducers.max=<number>
>> 09/04/20 22:49:49 INFO exec.ExecDriver:   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to set a constant number of
>> reducers:
>>   set mapred.reduce.tasks=<number>
>> 09/04/20 22:49:49 INFO exec.ExecDriver:   set mapred.reduce.tasks=<number>
>> ################################################
>> 
>> 
>> On Mon, Apr 20, 2009 at 3:26 PM, Raghu Murthy <rm...@facebook.com> wrote:
>>> In our setup we have 10s of people running queries simultaneously and we are
>>> not seeing this problem.
>>> 
>>> Are machine1 and machine2 part of the same hadoop map-reduce cluster as well
>>> (you mentioned that they are point to the same hadoop instance, which
>>> indicates that they use the same HDFS instance)? If so, how many map/reduce
>>> slots have you configured each of these machines with? And, does your query
>>> use up all the available slots? If that is the case, your second query will
>>> 'hang' because its waiting for slots to become free.
>>> 
>>> On 4/20/09 3:16 PM, "javateck javateck" <ja...@gmail.com> wrote:
>>> 
>>>>> It seems that hive is not allowing running multiple queries to the same
>>>> set of
>>>>> data at a time even I have two hive instances, each server running on a
>>>>> separate machine.
>>>>> 
>>>>> My set up is like following:
>>>>> 1. On machine 1, I installed a standalone derby server and started it, and
>>>>> also started a hive standalone server on machine 1, and I created a table,
>>>>> let's call it testTB
>>>>> 2. On machine 2, I did the same thing as machine 1. And I also created a
>>>> table
>>>>> testTB, which is having a same structure as that on machine 1
>>>>> 3. I load some data with a specific partition (for example, 2009-04-20)
>>>>> through machine 1 to testTB, and on machine 2, I just add a partition on
>>>>> >>> the
>>>>> table testTB since machine 1&2 are pointing to the same hadoop instance,
>>>> query
>>>>> is like "alter table testTB add partition (dt='2009-04-20')". And I
>>>> checked
>>>>> that query on machine 1 & 2 were working when querying the newly loaded
>>>> data
>>>>> 4. I issued a query through JDBC to query the newly specific partition on
>>>>> machine 1, and issued anther query to machine 2 on the same partition at
>>>>> >>> the
>>>>> same time.
>>>>> 5. I find that only one query was running when I saw the screen on machine
>>>>> >>> 1 &
>>>>> 2, the other was hanging.
>>>>> 
>>>>> Can someone confirm is this the behavior by design? The reason I'm doing
>>>> this
>>>>> is to be able to run multiple queries at the same time in parallel since
>>>>> >>> one
>>>>> hive instance is not supporting multi-threading currently. Currently I'm
>>>> stuck
>>>>> because I have to run queries sequentially which will be taking hours.
>>>>> 
>>>>> thanks
>>>>> 
>>>>> On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <ja...@gmail.com>
>>>> wrote:
>>>>>>> thanks, guys, that's working for single table space if starting a derby
>>>>>>> db
>>>>>>> server. Now my issue is that if I have two hive standalone instances
>>>>> running
>>>>>>> on the same machine, it seems that only one instance can run the
>>>>> mapreduce
>>>>>>> jobs at one time, the other is just hanging there,  I don't know if it's
>>>>>>> limited by the default settings that only allow two mappers working at
>>>>>>> >>>> the
>>>>>>> same time by one tasktracker that's limited by hadoop side.
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com>
>>>>> wrote:
>>>>>>>>> This should work. We have users running queries from their own
>>>>>> physical
>>>>>>>>> machines and sometimes from different hive directories as well. My
>>>>>> guess is
>>>>>>>>> that the configuration in one of the conf files is not correct or
>>>>>>>>> HADOOP_HOME
>>>>>>>>> is not set properly. Are you using a local metastore or do you run it
>>>>>>>>> >>>>> as a
>>>>>>>>> server or have it backed by a mysql/Derby server?
>>>>>>>>> 
>>>>>>>>> Ashish
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ________________________________________
>>>>>>>>> From: javateck javateck [javateck@gmail.com]
>>>>>>>>> Sent: Saturday, April 18, 2009 2:41 AM
>>>>>>>>> To: hive-user@hadoop.apache.org
>>>>>>>>> Subject: hive standalone machine mapreduce can only run one query on
>>>>>>>>> the
>>>>>>>>> same
>>>>>>>>>   partition
>>>>>>>>> 
>>>>>>>>> I have two standalone machines running on the same box, each of course
>>>>>>>>> running on a different port. And these two instances each has its own
>>>>>>>>> >>>>> table
>>>>>>>>> space (by starting hive from different directories), but they have the
>>>>>>>>> >>>>> same
>>>>>>>>> table structure, both of them have the same partitions also, and they
>>>>>>>>> are
>>>>>>>>> pointing to the same hadoop directory, for example, pointing to the
>>>>>>>>> >>>>> same
>>>>>>>>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
>>>>>>>>> 
>>>>>>>>> When I run two hive queries on the same partition, only one query can
>>>>>>>>> proceed, the other is hanging there and can proceed only after the
>>>>>> first one
>>>>>>>>> is done.
>>>>>>>>> 
>>>>>>>>> Question on this:
>>>>>>>>> 1. is this the expected behavior? and if so, how go around this?
>>>>>>>>> 
>>>>>>>>> 2. if I have two machines, each running on different machine, but
>>>>>> sharing
>>>>>>>>> the
>>>>>>>>> same hadoop directory (sharing single table store), can I query the
>>>>>>>>> >>>>> same
>>>>>>>>> table for the same partition from the two machines simultaneously?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> thanks
>>>>>>> 
>>>>> 
>>> 
>> 
> 


Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by javateck javateck <ja...@gmail.com>.
Hi, Raghu,

the other question I have is that if I run multiple instances of hive
server, will this improve overall performance? since I'm thinking multiple
instances could spread the queries.
thanks,


On Mon, Apr 20, 2009 at 3:57 PM, javateck javateck <ja...@gmail.com>wrote:

> Hi Raghu,
>
>   Machine 1 is the namenode+datanode+jobtracker+tasktracker, machine 2 is
> the datanode+tasktracker. Each machine's tasktracker can run 2 map tasks at
> one time (which is shown below). Actually I'm not sure if the tasktracker
> used up the mapred slots, not sure how hive is utilizing it, below I
> attached one log segment.
>
> thanks
>
> below is my hadoop settings, I just use default now:
> ################################################
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of reduce tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
> ################################################
>
> And below is one segment of the output when I kicked off the query:
>
> ################################################
> 09/04/20 22:49:49 INFO ql.Driver: Semantic Analysis Completed
> Total MapReduce jobs = 1
> 09/04/20 22:49:49 INFO ql.Driver: Total MapReduce jobs = 1
> Number of reduce tasks determined at compile time: 1
> 09/04/20 22:49:49 INFO exec.ExecDriver: Number of reduce tasks determined
> at compile time: 1
> In order to change the average load for a reducer (in bytes):
> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to change the average load
> for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
> hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to limit the maximum
> number of reducers:
>   set hive.exec.reducers.max=<number>
> 09/04/20 22:49:49 INFO exec.ExecDriver:   set
> hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> 09/04/20 22:49:49 INFO exec.ExecDriver: In order to set a constant number
> of reducers:
>   set mapred.reduce.tasks=<number>
> 09/04/20 22:49:49 INFO exec.ExecDriver:   set mapred.reduce.tasks=<number>
> ################################################
>
>
> On Mon, Apr 20, 2009 at 3:26 PM, Raghu Murthy <rm...@facebook.com>wrote:
>
>> In our setup we have 10s of people running queries simultaneously and we
>> are
>> not seeing this problem.
>>
>> Are machine1 and machine2 part of the same hadoop map-reduce cluster as
>> well
>> (you mentioned that they are point to the same hadoop instance, which
>> indicates that they use the same HDFS instance)? If so, how many
>> map/reduce
>> slots have you configured each of these machines with? And, does your
>> query
>> use up all the available slots? If that is the case, your second query
>> will
>> 'hang' because its waiting for slots to become free.
>>
>> On 4/20/09 3:16 PM, "javateck javateck" <ja...@gmail.com> wrote:
>>
>> > It seems that hive is not allowing running multiple queries to the same
>> set of
>> > data at a time even I have two hive instances, each server running on a
>> > separate machine.
>> >
>> > My set up is like following:
>> > 1. On machine 1, I installed a standalone derby server and started it,
>> and
>> > also started a hive standalone server on machine 1, and I created a
>> table,
>> > let's call it testTB
>> > 2. On machine 2, I did the same thing as machine 1. And I also created a
>> table
>> > testTB, which is having a same structure as that on machine 1
>> > 3. I load some data with a specific partition (for example, 2009-04-20)
>> > through machine 1 to testTB, and on machine 2, I just add a partition on
>> the
>> > table testTB since machine 1&2 are pointing to the same hadoop instance,
>> query
>> > is like "alter table testTB add partition (dt='2009-04-20')". And I
>> checked
>> > that query on machine 1 & 2 were working when querying the newly loaded
>> data
>> > 4. I issued a query through JDBC to query the newly specific partition
>> on
>> > machine 1, and issued anther query to machine 2 on the same partition at
>> the
>> > same time.
>> > 5. I find that only one query was running when I saw the screen on
>> machine 1 &
>> > 2, the other was hanging.
>> >
>> > Can someone confirm is this the behavior by design? The reason I'm doing
>> this
>> > is to be able to run multiple queries at the same time in parallel since
>> one
>> > hive instance is not supporting multi-threading currently. Currently I'm
>> stuck
>> > because I have to run queries sequentially which will be taking hours.
>> >
>> > thanks
>> >
>> > On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <ja...@gmail.com>
>> wrote:
>> >> thanks, guys, that's working for single table space if starting a derby
>> db
>> >> server. Now my issue is that if I have two hive standalone instances
>> running
>> >> on the same machine, it seems that only one instance can run the
>> mapreduce
>> >> jobs at one time, the other is just hanging there,  I don't know if
>> it's
>> >> limited by the default settings that only allow two mappers working at
>> the
>> >> same time by one tasktracker that's limited by hadoop side.
>> >>
>> >>
>> >> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com>
>> wrote:
>> >>> This should work. We have users running queries from their own
>> physical
>> >>> machines and sometimes from different hive directories as well. My
>> guess is
>> >>> that the configuration in one of the conf files is not correct or
>> >>> HADOOP_HOME
>> >>> is not set properly. Are you using a local metastore or do you run it
>> as a
>> >>> server or have it backed by a mysql/Derby server?
>> >>>
>> >>> Ashish
>> >>>
>> >>>
>> >>> ________________________________________
>> >>> From: javateck javateck [javateck@gmail.com]
>> >>> Sent: Saturday, April 18, 2009 2:41 AM
>> >>> To: hive-user@hadoop.apache.org
>> >>> Subject: hive standalone machine mapreduce can only run one query on
>> the
>> >>> same
>> >>>   partition
>> >>>
>> >>> I have two standalone machines running on the same box, each of course
>> >>> running on a different port. And these two instances each has its own
>> table
>> >>> space (by starting hive from different directories), but they have the
>> same
>> >>> table structure, both of them have the same partitions also, and they
>> are
>> >>> pointing to the same hadoop directory, for example, pointing to the
>> same
>> >>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
>> >>>
>> >>> When I run two hive queries on the same partition, only one query can
>> >>> proceed, the other is hanging there and can proceed only after the
>> first one
>> >>> is done.
>> >>>
>> >>> Question on this:
>> >>> 1. is this the expected behavior? and if so, how go around this?
>> >>>
>> >>> 2. if I have two machines, each running on different machine, but
>> sharing
>> >>> the
>> >>> same hadoop directory (sharing single table store), can I query the
>> same
>> >>> table for the same partition from the two machines simultaneously?
>> >>>
>> >>>
>> >>> thanks
>> >>
>> >
>>
>>
>

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by javateck javateck <ja...@gmail.com>.
Hi Raghu,

  Machine 1 is the namenode+datanode+jobtracker+tasktracker, machine 2 is
the datanode+tasktracker. Each machine's tasktracker can run 2 map tasks at
one time (which is shown below). Actually I'm not sure if the tasktracker
used up the mapred slots, not sure how hive is utilizing it, below I
attached one log segment.

thanks

below is my hadoop settings, I just use default now:
################################################
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>
################################################

And below is one segment of the output when I kicked off the query:

################################################
09/04/20 22:49:49 INFO ql.Driver: Semantic Analysis Completed
Total MapReduce jobs = 1
09/04/20 22:49:49 INFO ql.Driver: Total MapReduce jobs = 1
Number of reduce tasks determined at compile time: 1
09/04/20 22:49:49 INFO exec.ExecDriver: Number of reduce tasks determined at
compile time: 1
In order to change the average load for a reducer (in bytes):
09/04/20 22:49:49 INFO exec.ExecDriver: In order to change the average load
for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
09/04/20 22:49:49 INFO exec.ExecDriver:   set
hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
09/04/20 22:49:49 INFO exec.ExecDriver: In order to limit the maximum number
of reducers:
  set hive.exec.reducers.max=<number>
09/04/20 22:49:49 INFO exec.ExecDriver:   set
hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
09/04/20 22:49:49 INFO exec.ExecDriver: In order to set a constant number of
reducers:
  set mapred.reduce.tasks=<number>
09/04/20 22:49:49 INFO exec.ExecDriver:   set mapred.reduce.tasks=<number>
################################################


On Mon, Apr 20, 2009 at 3:26 PM, Raghu Murthy <rm...@facebook.com> wrote:

> In our setup we have 10s of people running queries simultaneously and we
> are
> not seeing this problem.
>
> Are machine1 and machine2 part of the same hadoop map-reduce cluster as
> well
> (you mentioned that they are point to the same hadoop instance, which
> indicates that they use the same HDFS instance)? If so, how many map/reduce
> slots have you configured each of these machines with? And, does your query
> use up all the available slots? If that is the case, your second query will
> 'hang' because its waiting for slots to become free.
>
> On 4/20/09 3:16 PM, "javateck javateck" <ja...@gmail.com> wrote:
>
> > It seems that hive is not allowing running multiple queries to the same
> set of
> > data at a time even I have two hive instances, each server running on a
> > separate machine.
> >
> > My set up is like following:
> > 1. On machine 1, I installed a standalone derby server and started it,
> and
> > also started a hive standalone server on machine 1, and I created a
> table,
> > let's call it testTB
> > 2. On machine 2, I did the same thing as machine 1. And I also created a
> table
> > testTB, which is having a same structure as that on machine 1
> > 3. I load some data with a specific partition (for example, 2009-04-20)
> > through machine 1 to testTB, and on machine 2, I just add a partition on
> the
> > table testTB since machine 1&2 are pointing to the same hadoop instance,
> query
> > is like "alter table testTB add partition (dt='2009-04-20')". And I
> checked
> > that query on machine 1 & 2 were working when querying the newly loaded
> data
> > 4. I issued a query through JDBC to query the newly specific partition on
> > machine 1, and issued anther query to machine 2 on the same partition at
> the
> > same time.
> > 5. I find that only one query was running when I saw the screen on
> machine 1 &
> > 2, the other was hanging.
> >
> > Can someone confirm is this the behavior by design? The reason I'm doing
> this
> > is to be able to run multiple queries at the same time in parallel since
> one
> > hive instance is not supporting multi-threading currently. Currently I'm
> stuck
> > because I have to run queries sequentially which will be taking hours.
> >
> > thanks
> >
> > On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <ja...@gmail.com>
> wrote:
> >> thanks, guys, that's working for single table space if starting a derby
> db
> >> server. Now my issue is that if I have two hive standalone instances
> running
> >> on the same machine, it seems that only one instance can run the
> mapreduce
> >> jobs at one time, the other is just hanging there,  I don't know if it's
> >> limited by the default settings that only allow two mappers working at
> the
> >> same time by one tasktracker that's limited by hadoop side.
> >>
> >>
> >> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com>
> wrote:
> >>> This should work. We have users running queries from their own physical
> >>> machines and sometimes from different hive directories as well. My
> guess is
> >>> that the configuration in one of the conf files is not correct or
> >>> HADOOP_HOME
> >>> is not set properly. Are you using a local metastore or do you run it
> as a
> >>> server or have it backed by a mysql/Derby server?
> >>>
> >>> Ashish
> >>>
> >>>
> >>> ________________________________________
> >>> From: javateck javateck [javateck@gmail.com]
> >>> Sent: Saturday, April 18, 2009 2:41 AM
> >>> To: hive-user@hadoop.apache.org
> >>> Subject: hive standalone machine mapreduce can only run one query on
> the
> >>> same
> >>>   partition
> >>>
> >>> I have two standalone machines running on the same box, each of course
> >>> running on a different port. And these two instances each has its own
> table
> >>> space (by starting hive from different directories), but they have the
> same
> >>> table structure, both of them have the same partitions also, and they
> are
> >>> pointing to the same hadoop directory, for example, pointing to the
> same
> >>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
> >>>
> >>> When I run two hive queries on the same partition, only one query can
> >>> proceed, the other is hanging there and can proceed only after the
> first one
> >>> is done.
> >>>
> >>> Question on this:
> >>> 1. is this the expected behavior? and if so, how go around this?
> >>>
> >>> 2. if I have two machines, each running on different machine, but
> sharing
> >>> the
> >>> same hadoop directory (sharing single table store), can I query the
> same
> >>> table for the same partition from the two machines simultaneously?
> >>>
> >>>
> >>> thanks
> >>
> >
>
>

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by Raghu Murthy <rm...@facebook.com>.
In our setup we have 10s of people running queries simultaneously and we are
not seeing this problem.

Are machine1 and machine2 part of the same hadoop map-reduce cluster as well
(you mentioned that they are point to the same hadoop instance, which
indicates that they use the same HDFS instance)? If so, how many map/reduce
slots have you configured each of these machines with? And, does your query
use up all the available slots? If that is the case, your second query will
'hang' because its waiting for slots to become free.

On 4/20/09 3:16 PM, "javateck javateck" <ja...@gmail.com> wrote:

> It seems that hive is not allowing running multiple queries to the same set of
> data at a time even I have two hive instances, each server running on a
> separate machine.
> 
> My set up is like following:
> 1. On machine 1, I installed a standalone derby server and started it, and
> also started a hive standalone server on machine 1, and I created a table,
> let's call it testTB
> 2. On machine 2, I did the same thing as machine 1. And I also created a table
> testTB, which is having a same structure as that on machine 1
> 3. I load some data with a specific partition (for example, 2009-04-20)
> through machine 1 to testTB, and on machine 2, I just add a partition on the
> table testTB since machine 1&2 are pointing to the same hadoop instance, query
> is like "alter table testTB add partition (dt='2009-04-20')". And I checked
> that query on machine 1 & 2 were working when querying the newly loaded data
> 4. I issued a query through JDBC to query the newly specific partition on
> machine 1, and issued anther query to machine 2 on the same partition at the
> same time.
> 5. I find that only one query was running when I saw the screen on machine 1 &
> 2, the other was hanging.
> 
> Can someone confirm is this the behavior by design? The reason I'm doing this
> is to be able to run multiple queries at the same time in parallel since one
> hive instance is not supporting multi-threading currently. Currently I'm stuck
> because I have to run queries sequentially which will be taking hours.
> 
> thanks
> 
> On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <ja...@gmail.com> wrote:
>> thanks, guys, that's working for single table space if starting a derby db
>> server. Now my issue is that if I have two hive standalone instances running
>> on the same machine, it seems that only one instance can run the mapreduce
>> jobs at one time, the other is just hanging there,  I don't know if it's
>> limited by the default settings that only allow two mappers working at the
>> same time by one tasktracker that's limited by hadoop side.
>> 
>> 
>> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com> wrote:
>>> This should work. We have users running queries from their own physical
>>> machines and sometimes from different hive directories as well. My guess is
>>> that the configuration in one of the conf files is not correct or
>>> HADOOP_HOME 
>>> is not set properly. Are you using a local metastore or do you run it as a
>>> server or have it backed by a mysql/Derby server?
>>> 
>>> Ashish
>>> 
>>> 
>>> ________________________________________
>>> From: javateck javateck [javateck@gmail.com]
>>> Sent: Saturday, April 18, 2009 2:41 AM
>>> To: hive-user@hadoop.apache.org
>>> Subject: hive standalone machine mapreduce can only run one query on the
>>> same 
>>>   partition
>>> 
>>> I have two standalone machines running on the same box, each of course
>>> running on a different port. And these two instances each has its own table
>>> space (by starting hive from different directories), but they have the same
>>> table structure, both of them have the same partitions also, and they are
>>> pointing to the same hadoop directory, for example, pointing to the same
>>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
>>> 
>>> When I run two hive queries on the same partition, only one query can
>>> proceed, the other is hanging there and can proceed only after the first one
>>> is done.
>>> 
>>> Question on this:
>>> 1. is this the expected behavior? and if so, how go around this?
>>> 
>>> 2. if I have two machines, each running on different machine, but sharing
>>> the 
>>> same hadoop directory (sharing single table store), can I query the same
>>> table for the same partition from the two machines simultaneously?
>>> 
>>> 
>>> thanks
>> 
> 


Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by javateck javateck <ja...@gmail.com>.
It seems that hive is not allowing running multiple queries to the same set
of data at a time even I have two hive instances, each server running on a
separate machine.
My set up is like following:

   1. On machine 1, I installed a standalone derby server and started it,
   and also started a hive standalone server on machine 1, and I created a
   table, let's call it testTB
   2. On machine 2, I did the same thing as machine 1. And I also created a
   table testTB, which is having a same structure as that on machine 1
   3. I load some data with a specific partition (for example, 2009-04-20)
   through machine 1 to testTB, and on machine 2, I just add a partition on the
   table testTB since machine 1&2 are pointing to the same hadoop instance,
   query is like "alter table testTB add partition (dt='2009-04-20')". And I
   checked that query on machine 1 & 2 were working when querying the newly
   loaded data
   4. I issued a query through JDBC to query the newly specific partition on
   machine 1, and issued anther query to machine 2 on the same partition at the
   same time.
   5. I find that only one query was running when I saw the screen on
   machine 1 & 2, the other was hanging.


Can someone confirm is this the behavior by design? The reason I'm doing
this is to be able to run multiple queries at the same time in parallel
since one hive instance is not supporting multi-threading currently.
Currently I'm stuck because I have to run queries sequentially which will be
taking hours.

thanks

On Sat, Apr 18, 2009 at 6:41 PM, javateck javateck <ja...@gmail.com>wrote:

> thanks, guys, that's working for single table space if starting a derby db
> server. Now my issue is that if I have two hive standalone instances running
> on the same machine, it seems that only one instance can run the mapreduce
> jobs at one time, the other is just hanging there,  I don't know if it's
> limited by the default settings that only allow two mappers working at the
> same time by one tasktracker that's limited by hadoop side.
>
>
> On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com>wrote:
>
>> This should work. We have users running queries from their own physical
>> machines and sometimes from different hive directories as well. My guess is
>> that the configuration in one of the conf files is not correct or
>> HADOOP_HOME is not set properly. Are you using a local metastore or do you
>> run it as a server or have it backed by a mysql/Derby server?
>>
>> Ashish
>>
>>
>> ________________________________________
>> From: javateck javateck [javateck@gmail.com]
>> Sent: Saturday, April 18, 2009 2:41 AM
>> To: hive-user@hadoop.apache.org
>> Subject: hive standalone machine mapreduce can only run one query on the
>> same   partition
>>
>> I have two standalone machines running on the same box, each of course
>> running on a different port. And these two instances each has its own table
>> space (by starting hive from different directories), but they have the same
>> table structure, both of them have the same partitions also, and they are
>> pointing to the same hadoop directory, for example, pointing to the same
>> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
>>
>> When I run two hive queries on the same partition, only one query can
>> proceed, the other is hanging there and can proceed only after the first one
>> is done.
>>
>> Question on this:
>> 1. is this the expected behavior? and if so, how go around this?
>>
>> 2. if I have two machines, each running on different machine, but sharing
>> the same hadoop directory (sharing single table store), can I query the same
>> table for the same partition from the two machines simultaneously?
>>
>>
>> thanks
>>
>
>

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by javateck javateck <ja...@gmail.com>.
thanks, guys, that's working for single table space if starting a derby db
server. Now my issue is that if I have two hive standalone instances running
on the same machine, it seems that only one instance can run the mapreduce
jobs at one time, the other is just hanging there,  I don't know if it's
limited by the default settings that only allow two mappers working at the
same time by one tasktracker that's limited by hadoop side.

On Sat, Apr 18, 2009 at 9:03 AM, Ashish Thusoo <at...@facebook.com> wrote:

> This should work. We have users running queries from their own physical
> machines and sometimes from different hive directories as well. My guess is
> that the configuration in one of the conf files is not correct or
> HADOOP_HOME is not set properly. Are you using a local metastore or do you
> run it as a server or have it backed by a mysql/Derby server?
>
> Ashish
>
>
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Saturday, April 18, 2009 2:41 AM
> To: hive-user@hadoop.apache.org
> Subject: hive standalone machine mapreduce can only run one query on the
> same   partition
>
> I have two standalone machines running on the same box, each of course
> running on a different port. And these two instances each has its own table
> space (by starting hive from different directories), but they have the same
> table structure, both of them have the same partitions also, and they are
> pointing to the same hadoop directory, for example, pointing to the same
> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
>
> When I run two hive queries on the same partition, only one query can
> proceed, the other is hanging there and can proceed only after the first one
> is done.
>
> Question on this:
> 1. is this the expected behavior? and if so, how go around this?
>
> 2. if I have two machines, each running on different machine, but sharing
> the same hadoop directory (sharing single table store), can I query the same
> table for the same partition from the two machines simultaneously?
>
>
> thanks
>

RE: hive standalone machine mapreduce can only run one query on the same partition

Posted by Ashish Thusoo <at...@facebook.com>.
This should work. We have users running queries from their own physical machines and sometimes from different hive directories as well. My guess is that the configuration in one of the conf files is not correct or HADOOP_HOME is not set properly. Are you using a local metastore or do you run it as a server or have it backed by a mysql/Derby server?

Ashish


________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Saturday, April 18, 2009 2:41 AM
To: hive-user@hadoop.apache.org
Subject: hive standalone machine mapreduce can only run one query on the same   partition

I have two standalone machines running on the same box, each of course running on a different port. And these two instances each has its own table space (by starting hive from different directories), but they have the same table structure, both of them have the same partitions also, and they are pointing to the same hadoop directory, for example, pointing to the same /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.

When I run two hive queries on the same partition, only one query can proceed, the other is hanging there and can proceed only after the first one is done.

Question on this:
1. is this the expected behavior? and if so, how go around this?

2. if I have two machines, each running on different machine, but sharing the same hadoop directory (sharing single table store), can I query the same table for the same partition from the two machines simultaneously?


thanks

Re: hive standalone machine mapreduce can only run one query on the same partition

Posted by Edward Capriolo <ed...@gmail.com>.
You can use the same table space and meta store if you use an external
database.

http://wiki.apache.org/hadoop/HiveDerbyServerMode

We probably should highlight this more as the setup is single user
without an external meta store.

On Sat, Apr 18, 2009 at 5:41 AM, javateck javateck <ja...@gmail.com> wrote:
> I have two standalone machines running on the same box, each of course
> running on a different port. And these two instances each has its own table
> space (by starting hive from different directories), but they have the same
> table structure, both of them have the same partitions also, and they are
> pointing to the same hadoop directory, for example, pointing to the same
> /user/hive/warehouse/mytable/dt=2009-04-15 for one partition.
> When I run two hive queries on the same partition, only one query can
> proceed, the other is hanging there and can proceed only after the first one
> is done.
> Question on this:
> 1. is this the expected behavior? and if so, how go around this?
> 2. if I have two machines, each running on different machine, but sharing
> the same hadoop directory (sharing single table store), can I query the same
> table for the same partition from the two machines simultaneously?
>
> thanks