You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by ravi teja <ra...@gmail.com> on 2016/08/31 12:42:34 UTC

Quota for rogue ad-hoc queries

Hi Community,

Many users run adhoc hive queries on our platform.
Some rogue queries managed to fill up the hdfs space and causing mainstream
queries to fail.

We wanted to limit the data generated by these adhoc queries.
We are aware of strict param which limits the data being scanned, but it is
of less help as huge number of user tables aren't partitioned.

Is there a way we can limit the data generated from hive per query, like a
hve parameter for setting HDFS quotas for job level *scratch* directory or
any other approach?
What's the general approach to gaurdrail such multi-tenant cases.

Thanks in advance,
Ravi

Re: Quota for rogue ad-hoc queries

Posted by Edward Capriolo <ed...@gmail.com>.

I have written nagios scripts that watch the job tracker UI and report when
things take too long.

On Thu, Sep 1, 2016 at 11:08 AM, Loïc Chanel <lo...@telecomnancy.net>
wrote:

> On the topic of timeout, if I may say, they are a dangerous way to deal
> with requests as a "good" request may last longer than an "evil" one.
> Be sure timeouts won't kill any important job before putting them into
> place. You can set these things on in the components (Tez, MapReduce ...)
> parameters, but not directly into YARN. At least it was the case when I
> tried this (one year ago).
>
> Regards,
>
> Loïc CHANEL
> System & virtualization engineer
> TO - XaaS Ind - Worldline (Villeurbanne, France)
>
> 2016-09-01 16:52 GMT+02:00 Stephen Sprague <sp...@gmail.com>:
>
>> > rogue queries
>>
>> so this really isn't limited to just hive is it?  any dbms system perhaps
>> has to contend with this.  even malicious rogue queries as a matter of fact.
>>
>> timeouts are cheap way systems handle this - assuming time is related to
>> resource. i'm sure beeline or whatever client you use has a timeout feature.
>>
>> maybe one could write a separate service - say a governor - that watches
>> over YARN (or hdfs or whatever resource is rare) - and terminates the
>> process if it goes beyond a threshold.  think OOM killer.
>>
>> but, yeah, i admittedly don't know of something out there already you can
>> just tap into but YARN's Resource Manager seems to be place i'd research
>> for starters. Just look look at its name. :)
>>
>> my unsolicited 2 cents.
>>
>>
>>
>> On Wed, Aug 31, 2016 at 10:24 PM, ravi teja <ra...@gmail.com> wrote:
>>
>>> Thanks Mich,
>>>
>>> Unfortunately we have many insert queries.
>>> Are there any other ways?
>>>
>>> Thanks,
>>> Ravi
>>>
>>> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Trt this
>>>>
>>>> hive.limit.optimize.fetch.max
>>>>
>>>>    - Default Value: 50000
>>>>    - Added In: Hive 0.8.0
>>>>
>>>> Maximum number of rows allowed for a smaller subset of data for simple
>>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>>>> limit.
>>>>
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 31 August 2016 at 13:42, ravi teja <ra...@gmail.com> wrote:
>>>>
>>>>> Hi Community,
>>>>>
>>>>> Many users run adhoc hive queries on our platform.
>>>>> Some rogue queries managed to fill up the hdfs space and causing
>>>>> mainstream queries to fail.
>>>>>
>>>>> We wanted to limit the data generated by these adhoc queries.
>>>>> We are aware of strict param which limits the data being scanned, but
>>>>> it is of less help as huge number of user tables aren't partitioned.
>>>>>
>>>>> Is there a way we can limit the data generated from hive per query,
>>>>> like a hve parameter for setting HDFS quotas for job level *scratch*
>>>>> directory or any other approach?
>>>>> What's the general approach to gaurdrail such multi-tenant cases.
>>>>>
>>>>> Thanks in advance,
>>>>> Ravi
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Quota for rogue ad-hoc queries

Posted by Loïc Chanel <lo...@telecomnancy.net>.

On the topic of timeout, if I may say, they are a dangerous way to deal
with requests as a "good" request may last longer than an "evil" one.
Be sure timeouts won't kill any important job before putting them into
place. You can set these things on in the components (Tez, MapReduce ...)
parameters, but not directly into YARN. At least it was the case when I
tried this (one year ago).

Regards,

Loïc CHANEL
System & virtualization engineer
TO - XaaS Ind - Worldline (Villeurbanne, France)

2016-09-01 16:52 GMT+02:00 Stephen Sprague <sp...@gmail.com>:

> > rogue queries
>
> so this really isn't limited to just hive is it?  any dbms system perhaps
> has to contend with this.  even malicious rogue queries as a matter of fact.
>
> timeouts are cheap way systems handle this - assuming time is related to
> resource. i'm sure beeline or whatever client you use has a timeout feature.
>
> maybe one could write a separate service - say a governor - that watches
> over YARN (or hdfs or whatever resource is rare) - and terminates the
> process if it goes beyond a threshold.  think OOM killer.
>
> but, yeah, i admittedly don't know of something out there already you can
> just tap into but YARN's Resource Manager seems to be place i'd research
> for starters. Just look look at its name. :)
>
> my unsolicited 2 cents.
>
>
>
> On Wed, Aug 31, 2016 at 10:24 PM, ravi teja <ra...@gmail.com> wrote:
>
>> Thanks Mich,
>>
>> Unfortunately we have many insert queries.
>> Are there any other ways?
>>
>> Thanks,
>> Ravi
>>
>> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Trt this
>>>
>>> hive.limit.optimize.fetch.max
>>>
>>>    - Default Value: 50000
>>>    - Added In: Hive 0.8.0
>>>
>>> Maximum number of rows allowed for a smaller subset of data for simple
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>>> limit.
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 August 2016 at 13:42, ravi teja <ra...@gmail.com> wrote:
>>>
>>>> Hi Community,
>>>>
>>>> Many users run adhoc hive queries on our platform.
>>>> Some rogue queries managed to fill up the hdfs space and causing
>>>> mainstream queries to fail.
>>>>
>>>> We wanted to limit the data generated by these adhoc queries.
>>>> We are aware of strict param which limits the data being scanned, but
>>>> it is of less help as huge number of user tables aren't partitioned.
>>>>
>>>> Is there a way we can limit the data generated from hive per query,
>>>> like a hve parameter for setting HDFS quotas for job level *scratch*
>>>> directory or any other approach?
>>>> What's the general approach to gaurdrail such multi-tenant cases.
>>>>
>>>> Thanks in advance,
>>>> Ravi
>>>>
>>>
>>>
>>
>

Re: Quota for rogue ad-hoc queries

Posted by Stephen Sprague <sp...@gmail.com>.

> rogue queries

so this really isn't limited to just hive is it?  any dbms system perhaps
has to contend with this.  even malicious rogue queries as a matter of fact.

timeouts are cheap way systems handle this - assuming time is related to
resource. i'm sure beeline or whatever client you use has a timeout feature.

maybe one could write a separate service - say a governor - that watches
over YARN (or hdfs or whatever resource is rare) - and terminates the
process if it goes beyond a threshold.  think OOM killer.

but, yeah, i admittedly don't know of something out there already you can
just tap into but YARN's Resource Manager seems to be place i'd research
for starters. Just look look at its name. :)

my unsolicited 2 cents.



On Wed, Aug 31, 2016 at 10:24 PM, ravi teja <ra...@gmail.com> wrote:

> Thanks Mich,
>
> Unfortunately we have many insert queries.
> Are there any other ways?
>
> Thanks,
> Ravi
>
> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Trt this
>>
>> hive.limit.optimize.fetch.max
>>
>>    - Default Value: 50000
>>    - Added In: Hive 0.8.0
>>
>> Maximum number of rows allowed for a smaller subset of data for simple
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>> limit.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 August 2016 at 13:42, ravi teja <ra...@gmail.com> wrote:
>>
>>> Hi Community,
>>>
>>> Many users run adhoc hive queries on our platform.
>>> Some rogue queries managed to fill up the hdfs space and causing
>>> mainstream queries to fail.
>>>
>>> We wanted to limit the data generated by these adhoc queries.
>>> We are aware of strict param which limits the data being scanned, but it
>>> is of less help as huge number of user tables aren't partitioned.
>>>
>>> Is there a way we can limit the data generated from hive per query, like
>>> a hve parameter for setting HDFS quotas for job level *scratch*
>>> directory or any other approach?
>>> What's the general approach to gaurdrail such multi-tenant cases.
>>>
>>> Thanks in advance,
>>> Ravi
>>>
>>
>>
>

Re: Quota for rogue ad-hoc queries

Posted by ravi teja <ra...@gmail.com>.

Hi,

I am trying to add this feature in hive ( HIVE-11735
<https://issues.apache.org/jira/browse/HIVE-11735> ).
But hit a road block while setting the quota during session folder creation
as the quota can be only set by super user in HDFS.
Any thoughts how to avoid this issue?

Thanks,
Ravi

On Fri, Sep 2, 2016 at 2:35 PM, ravi teja <ra...@gmail.com> wrote:

> Hi Gopal,
>
> We are using MR not Tez.
> I feel since the adhoc queries data output size is something we can
> determine, rather than the time the job takes, I was wondering more from
> output size/number of rows quota.
>
> Thanks,
> Ravi
>
> On Fri, Sep 2, 2016 at 2:57 AM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>> > Are there any other ways?
>>
>> Are you running Tez?
>>
>> Tez heartbeats counters back to the AppMaster every few seconds, so the
>> AppMaster has an accurate (but delayed) count of HDFS_BYTES_WRITTEN.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Quota for rogue ad-hoc queries

Posted by ravi teja <ra...@gmail.com>.

Hi Gopal,

We are using MR not Tez.
I feel since the adhoc queries data output size is something we can
determine, rather than the time the job takes, I was wondering more from
output size/number of rows quota.

Thanks,
Ravi

On Fri, Sep 2, 2016 at 2:57 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > Are there any other ways?
>
> Are you running Tez?
>
> Tez heartbeats counters back to the AppMaster every few seconds, so the
> AppMaster has an accurate (but delayed) count of HDFS_BYTES_WRITTEN.
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>

Re: Quota for rogue ad-hoc queries

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Are there any other ways?

Are you running Tez?

Tez heartbeats counters back to the AppMaster every few seconds, so the
AppMaster has an accurate (but delayed) count of HDFS_BYTES_WRITTEN.

Cheers,
Gopal

Re: Quota for rogue ad-hoc queries

Posted by ravi teja <ra...@gmail.com>.

Thanks Mich,

Unfortunately we have many insert queries.
Are there any other ways?

Thanks,
Ravi

On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Trt this
>
> hive.limit.optimize.fetch.max
>
>    - Default Value: 50000
>    - Added In: Hive 0.8.0
>
> Maximum number of rows allowed for a smaller subset of data for simple
> LIMIT, if it is a fetch query. Insert queries are not restricted by this
> limit.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 August 2016 at 13:42, ravi teja <ra...@gmail.com> wrote:
>
>> Hi Community,
>>
>> Many users run adhoc hive queries on our platform.
>> Some rogue queries managed to fill up the hdfs space and causing
>> mainstream queries to fail.
>>
>> We wanted to limit the data generated by these adhoc queries.
>> We are aware of strict param which limits the data being scanned, but it
>> is of less help as huge number of user tables aren't partitioned.
>>
>> Is there a way we can limit the data generated from hive per query, like
>> a hve parameter for setting HDFS quotas for job level *scratch*
>> directory or any other approach?
>> What's the general approach to gaurdrail such multi-tenant cases.
>>
>> Thanks in advance,
>> Ravi
>>
>
>

Re: Quota for rogue ad-hoc queries

Posted by Mich Talebzadeh <mi...@gmail.com>.

Trt this

hive.limit.optimize.fetch.max

   - Default Value: 50000
   - Added In: Hive 0.8.0

Maximum number of rows allowed for a smaller subset of data for simple
LIMIT, if it is a fetch query. Insert queries are not restricted by this
limit.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 31 August 2016 at 13:42, ravi teja <ra...@gmail.com> wrote:

> Hi Community,
>
> Many users run adhoc hive queries on our platform.
> Some rogue queries managed to fill up the hdfs space and causing
> mainstream queries to fail.
>
> We wanted to limit the data generated by these adhoc queries.
> We are aware of strict param which limits the data being scanned, but it
> is of less help as huge number of user tables aren't partitioned.
>
> Is there a way we can limit the data generated from hive per query, like a
> hve parameter for setting HDFS quotas for job level *scratch* directory
> or any other approach?
> What's the general approach to gaurdrail such multi-tenant cases.
>
> Thanks in advance,
> Ravi
>