You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Abhishek kumar <ab...@gmail.com> on 2014/12/29 05:38:35 UTC

Hive being slow

Hi,

I am using hive 0.14 which runs over hbase (having ~10 GB of data). I am
facing issues in terms of slowness when querying over Hbase. My query looks
like following:

select * from table1 where id > 'zzzz';  (id is the row-key)

As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
'startKey'. Now given there are no such rows-keys (id) which satisfies this
criteria, this query should be extremely fast. But hive is taking a lot of
time, looks like full hbase table scan.
Can someone let me know where am I wrong in understanding the whole thing?

--
Abhishek

Re: Hive being slow

Posted by Abhishek kumar <ab...@gmail.com>.

0.14.0

--
Abhishek

On Thu, Jan 15, 2015 at 10:43 PM, Ashutosh Chauhan <ha...@apache.org>
wrote:

> which hive version you are using ?
>
> On Thu, Jan 15, 2015 at 12:44 AM, Abhishek kumar <abhishekiitg10@gmail.com
> > wrote:
>
>> Hi,
>>
>> Thanks for the reply.
>>
>> I tried that, but no luck. The map-reduce seems to be stuck (taking a lot
>> of time, just for 65 lakhs of Hbase rows). I am attaching the log file (or
>> http://pastebin.com/BUYDUiEu)
>>
>> My only question is why the filter push-down for row-key (*startKey* and
>> *stopKey* for the *Scanner*) is not happening to Hbase. If the push-down
>> happens, then Hbase will resolve this Scanner very fast and no matter MR
>> job runs or not, the query resolution will be very fast.
>>
>> --
>> Abhishek
>>
>> On Thu, Jan 15, 2015 at 1:59 AM, Ashutosh Chauhan <ha...@apache.org>
>> wrote:
>>
>>> Can you run your query with following config:
>>>
>>> hive> set hive.fetch.task.conversion=none;
>>>
>>> and run your two queries with this. Lets see if this makes a difference.
>>> My expectation is this will result in MR job getting launched and thus
>>> runtimes might be different.
>>>
>>> On Sat, Jan 10, 2015 at 4:54 PM, Abhishek kumar <
>>> abhishekiitg10@gmail.com> wrote:
>>>
>>>> First I tried running the query: select * from table1 where id =
>>>> 'value';
>>>> It was very fast, as expected since Hbase replied the results very
>>>> fast. In this case, I observed no map/reduce task getting spawned.
>>>>
>>>> Now, for the query, select * from table1 where id > 'zzz', I expected
>>>> the filter push down to happen (at least the 0.14 code says). And since,
>>>> there were no results found, so Hbase will again reply very fast and thus
>>>> hive should output the query's result very fast. But, this is not
>>>> happening, and from the logs of datanode, it looks like a lot of reads are
>>>> happening (close to full table scan of 10GBs of data). I expected the
>>>> response time to be very close to the above query's time.
>>>>
>>>> I will check about the number of task getting launched.
>>>>
>>>> My questions are:
>>>> * Why there was no any filter pushdown (id > 'zzz') happening for this
>>>> very simple query.
>>>> * Since this query can only be resolved from HBase, will Hive launch
>>>> map tasks (last time, I guess I observed no map task getting launched)
>>>>
>>>> --
>>>> Abhishek
>>>>
>>>> On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan <hashutosh@apache.org
>>>> > wrote:
>>>>
>>>>> Hi Abhishek,
>>>>>
>>>>> How are you determining its resulting in full table scan? One way to
>>>>> ascertain that filter got pushed down is to see how many tasks were
>>>>> launched for your query, with and without filter. One would expect lower #
>>>>> of splits (and thus tasks) for query having filter.
>>>>>
>>>>> Thanks,
>>>>> Ashutosh
>>>>>
>>>>> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <
>>>>> abhishekiitg10@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I
>>>>>> am facing issues in terms of slowness when querying over Hbase. My query
>>>>>> looks like following:
>>>>>>
>>>>>> select * from table1 where id > 'zzzz';  (id is the row-key)
>>>>>>
>>>>>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner
>>>>>> as 'startKey'. Now given there are no such rows-keys (id) which satisfies
>>>>>> this criteria, this query should be extremely fast. But hive is taking a
>>>>>> lot of time, looks like full hbase table scan.
>>>>>> Can someone let me know where am I wrong in understanding the whole
>>>>>> thing?
>>>>>>
>>>>>> --
>>>>>> Abhishek
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive being slow

Posted by Ashutosh Chauhan <ha...@apache.org>.

which hive version you are using ?

On Thu, Jan 15, 2015 at 12:44 AM, Abhishek kumar <ab...@gmail.com>
wrote:

> Hi,
>
> Thanks for the reply.
>
> I tried that, but no luck. The map-reduce seems to be stuck (taking a lot
> of time, just for 65 lakhs of Hbase rows). I am attaching the log file (or
> http://pastebin.com/BUYDUiEu)
>
> My only question is why the filter push-down for row-key (*startKey* and
> *stopKey* for the *Scanner*) is not happening to Hbase. If the push-down
> happens, then Hbase will resolve this Scanner very fast and no matter MR
> job runs or not, the query resolution will be very fast.
>
> --
> Abhishek
>
> On Thu, Jan 15, 2015 at 1:59 AM, Ashutosh Chauhan <ha...@apache.org>
> wrote:
>
>> Can you run your query with following config:
>>
>> hive> set hive.fetch.task.conversion=none;
>>
>> and run your two queries with this. Lets see if this makes a difference.
>> My expectation is this will result in MR job getting launched and thus
>> runtimes might be different.
>>
>> On Sat, Jan 10, 2015 at 4:54 PM, Abhishek kumar <abhishekiitg10@gmail.com
>> > wrote:
>>
>>> First I tried running the query: select * from table1 where id =
>>> 'value';
>>> It was very fast, as expected since Hbase replied the results very fast.
>>> In this case, I observed no map/reduce task getting spawned.
>>>
>>> Now, for the query, select * from table1 where id > 'zzz', I expected
>>> the filter push down to happen (at least the 0.14 code says). And since,
>>> there were no results found, so Hbase will again reply very fast and thus
>>> hive should output the query's result very fast. But, this is not
>>> happening, and from the logs of datanode, it looks like a lot of reads are
>>> happening (close to full table scan of 10GBs of data). I expected the
>>> response time to be very close to the above query's time.
>>>
>>> I will check about the number of task getting launched.
>>>
>>> My questions are:
>>> * Why there was no any filter pushdown (id > 'zzz') happening for this
>>> very simple query.
>>> * Since this query can only be resolved from HBase, will Hive launch map
>>> tasks (last time, I guess I observed no map task getting launched)
>>>
>>> --
>>> Abhishek
>>>
>>> On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan <ha...@apache.org>
>>> wrote:
>>>
>>>> Hi Abhishek,
>>>>
>>>> How are you determining its resulting in full table scan? One way to
>>>> ascertain that filter got pushed down is to see how many tasks were
>>>> launched for your query, with and without filter. One would expect lower #
>>>> of splits (and thus tasks) for query having filter.
>>>>
>>>> Thanks,
>>>> Ashutosh
>>>>
>>>> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <
>>>> abhishekiitg10@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I
>>>>> am facing issues in terms of slowness when querying over Hbase. My query
>>>>> looks like following:
>>>>>
>>>>> select * from table1 where id > 'zzzz';  (id is the row-key)
>>>>>
>>>>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner
>>>>> as 'startKey'. Now given there are no such rows-keys (id) which satisfies
>>>>> this criteria, this query should be extremely fast. But hive is taking a
>>>>> lot of time, looks like full hbase table scan.
>>>>> Can someone let me know where am I wrong in understanding the whole
>>>>> thing?
>>>>>
>>>>> --
>>>>> Abhishek
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Hive being slow

Posted by Abhishek kumar <ab...@gmail.com>.

Hi,

Thanks for the reply.

I tried that, but no luck. The map-reduce seems to be stuck (taking a lot
of time, just for 65 lakhs of Hbase rows). I am attaching the log file (or
http://pastebin.com/BUYDUiEu)

My only question is why the filter push-down for row-key (*startKey* and
*stopKey* for the *Scanner*) is not happening to Hbase. If the push-down
happens, then Hbase will resolve this Scanner very fast and no matter MR
job runs or not, the query resolution will be very fast.

--
Abhishek

On Thu, Jan 15, 2015 at 1:59 AM, Ashutosh Chauhan <ha...@apache.org>
wrote:

> Can you run your query with following config:
>
> hive> set hive.fetch.task.conversion=none;
>
> and run your two queries with this. Lets see if this makes a difference.
> My expectation is this will result in MR job getting launched and thus
> runtimes might be different.
>
> On Sat, Jan 10, 2015 at 4:54 PM, Abhishek kumar <ab...@gmail.com>
> wrote:
>
>> First I tried running the query: select * from table1 where id = 'value';
>> It was very fast, as expected since Hbase replied the results very fast.
>> In this case, I observed no map/reduce task getting spawned.
>>
>> Now, for the query, select * from table1 where id > 'zzz', I expected
>> the filter push down to happen (at least the 0.14 code says). And since,
>> there were no results found, so Hbase will again reply very fast and thus
>> hive should output the query's result very fast. But, this is not
>> happening, and from the logs of datanode, it looks like a lot of reads are
>> happening (close to full table scan of 10GBs of data). I expected the
>> response time to be very close to the above query's time.
>>
>> I will check about the number of task getting launched.
>>
>> My questions are:
>> * Why there was no any filter pushdown (id > 'zzz') happening for this
>> very simple query.
>> * Since this query can only be resolved from HBase, will Hive launch map
>> tasks (last time, I guess I observed no map task getting launched)
>>
>> --
>> Abhishek
>>
>> On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan <ha...@apache.org>
>> wrote:
>>
>>> Hi Abhishek,
>>>
>>> How are you determining its resulting in full table scan? One way to
>>> ascertain that filter got pushed down is to see how many tasks were
>>> launched for your query, with and without filter. One would expect lower #
>>> of splits (and thus tasks) for query having filter.
>>>
>>> Thanks,
>>> Ashutosh
>>>
>>> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <
>>> abhishekiitg10@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I
>>>> am facing issues in terms of slowness when querying over Hbase. My query
>>>> looks like following:
>>>>
>>>> select * from table1 where id > 'zzzz';  (id is the row-key)
>>>>
>>>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
>>>> 'startKey'. Now given there are no such rows-keys (id) which satisfies this
>>>> criteria, this query should be extremely fast. But hive is taking a lot of
>>>> time, looks like full hbase table scan.
>>>> Can someone let me know where am I wrong in understanding the whole
>>>> thing?
>>>>
>>>> --
>>>> Abhishek
>>>>
>>>
>>>
>>
>

Re: Hive being slow

Posted by Ashutosh Chauhan <ha...@apache.org>.

Can you run your query with following config:

hive> set hive.fetch.task.conversion=none;

and run your two queries with this. Lets see if this makes a difference. My
expectation is this will result in MR job getting launched and thus
runtimes might be different.

On Sat, Jan 10, 2015 at 4:54 PM, Abhishek kumar <ab...@gmail.com>
wrote:

> First I tried running the query: select * from table1 where id = 'value';
> It was very fast, as expected since Hbase replied the results very fast.
> In this case, I observed no map/reduce task getting spawned.
>
> Now, for the query, select * from table1 where id > 'zzz', I expected the
> filter push down to happen (at least the 0.14 code says). And since, there
> were no results found, so Hbase will again reply very fast and thus hive
> should output the query's result very fast. But, this is not happening, and
> from the logs of datanode, it looks like a lot of reads are happening
> (close to full table scan of 10GBs of data). I expected the response time
> to be very close to the above query's time.
>
> I will check about the number of task getting launched.
>
> My questions are:
> * Why there was no any filter pushdown (id > 'zzz') happening for this
> very simple query.
> * Since this query can only be resolved from HBase, will Hive launch map
> tasks (last time, I guess I observed no map task getting launched)
>
> --
> Abhishek
>
> On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan <ha...@apache.org>
> wrote:
>
>> Hi Abhishek,
>>
>> How are you determining its resulting in full table scan? One way to
>> ascertain that filter got pushed down is to see how many tasks were
>> launched for your query, with and without filter. One would expect lower #
>> of splits (and thus tasks) for query having filter.
>>
>> Thanks,
>> Ashutosh
>>
>> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <abhishekiitg10@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I am
>>> facing issues in terms of slowness when querying over Hbase. My query looks
>>> like following:
>>>
>>> select * from table1 where id > 'zzzz';  (id is the row-key)
>>>
>>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
>>> 'startKey'. Now given there are no such rows-keys (id) which satisfies this
>>> criteria, this query should be extremely fast. But hive is taking a lot of
>>> time, looks like full hbase table scan.
>>> Can someone let me know where am I wrong in understanding the whole
>>> thing?
>>>
>>> --
>>> Abhishek
>>>
>>
>>
>

Re: Hive being slow

Posted by Abhishek kumar <ab...@gmail.com>.

First I tried running the query: select * from table1 where id = 'value';
It was very fast, as expected since Hbase replied the results very fast. In
this case, I observed no map/reduce task getting spawned.

Now, for the query, select * from table1 where id > 'zzz', I expected the
filter push down to happen (at least the 0.14 code says). And since, there
were no results found, so Hbase will again reply very fast and thus hive
should output the query's result very fast. But, this is not happening, and
from the logs of datanode, it looks like a lot of reads are happening
(close to full table scan of 10GBs of data). I expected the response time
to be very close to the above query's time.

I will check about the number of task getting launched.

My questions are:
* Why there was no any filter pushdown (id > 'zzz') happening for this
very simple query.
* Since this query can only be resolved from HBase, will Hive launch map
tasks (last time, I guess I observed no map task getting launched)

--
Abhishek

On Sat, Jan 10, 2015 at 4:14 AM, Ashutosh Chauhan <ha...@apache.org>
wrote:

> Hi Abhishek,
>
> How are you determining its resulting in full table scan? One way to
> ascertain that filter got pushed down is to see how many tasks were
> launched for your query, with and without filter. One would expect lower #
> of splits (and thus tasks) for query having filter.
>
> Thanks,
> Ashutosh
>
> On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <ab...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I am
>> facing issues in terms of slowness when querying over Hbase. My query looks
>> like following:
>>
>> select * from table1 where id > 'zzzz';  (id is the row-key)
>>
>> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
>> 'startKey'. Now given there are no such rows-keys (id) which satisfies this
>> criteria, this query should be extremely fast. But hive is taking a lot of
>> time, looks like full hbase table scan.
>> Can someone let me know where am I wrong in understanding the whole thing?
>>
>> --
>> Abhishek
>>
>
>

Re: Hive being slow

Posted by Ashutosh Chauhan <ha...@apache.org>.

Hi Abhishek,

How are you determining its resulting in full table scan? One way to
ascertain that filter got pushed down is to see how many tasks were
launched for your query, with and without filter. One would expect lower #
of splits (and thus tasks) for query having filter.

Thanks,
Ashutosh

On Sun, Dec 28, 2014 at 8:38 PM, Abhishek kumar <ab...@gmail.com>
wrote:

> Hi,
>
> I am using hive 0.14 which runs over hbase (having ~10 GB of data). I am
> facing issues in terms of slowness when querying over Hbase. My query looks
> like following:
>
> select * from table1 where id > 'zzzz';  (id is the row-key)
>
> As per the hive-code, id > 'zzz', is getting pushed to Hbase scanner as
> 'startKey'. Now given there are no such rows-keys (id) which satisfies this
> criteria, this query should be extremely fast. But hive is taking a lot of
> time, looks like full hbase table scan.
> Can someone let me know where am I wrong in understanding the whole thing?
>
> --
> Abhishek
>