You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by 廖松博 <li...@gridsum.com> on 2016/07/18 09:59:19 UTC

Impala user resource isolation best practice

Hello guys,

       My Company is using Cloudera Impala as our basic infrastructure for online data analysis. The most difficult part we met is resource isolation and instability.
According to our experiences in Impala, some big query which consume a vast amount of memory will crash impalad process(actually as worker but not coordinator, right?).
In our simplest scenario, user A is a very important customer and his queries are relatively small, user B is a unimportant user who may issue very large SQL to impala. It is unacceptable that the big query from user B crash the impalad process and affect the user experiences of user A. So resource isolation is the point.
But per the Impala documents : http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_admission.html , Impala resource isolation is soft limit, cannot strictly prevent query from user B affecting user A.
As I know llama(run impala with yarn) is not recommended and we actually tried it but disappointed about the performance and accuracy.
       Is there any best practice for user resource isolation? So different user will not affect each other.
       Thanks.

Best Regards,
Songbo

Re: Impala user resource isolation best practice

Posted by Matthew Jacobs <mj...@cloudera.com>.
By the way, some of the controls I mentioned were added in Impala 2.5,
so you should consider upgrading if you're not already using a newer
version of Impala.

Thanks,
Matt

On Mon, Jul 18, 2016 at 9:20 AM, Matthew Jacobs <mj...@cloudera.com> wrote:
> Hi Songbo,
>
> Right now the best you can do is with admission control with:
> (a) a single coordinator to avoid the possibility of over-admitting by
> different coordinators
> (b) setting default query mem limits so that individual queries are limited
>
> For your scenario, I'd recommend setting up 2 pools, one for user A
> and a second for user B. Set the max number of running queries for
> user A to something reasonable for the concurrency for that workload.
> Set the max memory for the user B pool to the portion of cluster
> memory you're willing to give to those queries. (Notice the pool with
> the small queries has the max number of running queries set and the
> pool with the fewer but larger big queries has the max memory set --
> that is intentional, the former is faster for admission but doesn't
> limit based on memory.) How well this will work depends on how well
> you can pick good numbers for these settings, which can be difficult
> and requires studying your workload.
>
> This isn't perfect resource isolation because rogue queries can still
> consume too much CPU or other resources, but it's the best you'll be
> able to do right now. In the future we will have better tools to make
> this easier.
>
> Best,
> Matt
>
> On Mon, Jul 18, 2016 at 2:59 AM, 廖松博 <li...@gridsum.com> wrote:
>> Hello guys,
>>
>>
>>
>>        My Company is using Cloudera Impala as our basic infrastructure for
>> online data analysis. The most difficult part we met is resource isolation
>> and instability.
>>
>> According to our experiences in Impala, some big query which consume a vast
>> amount of memory will crash impalad process(actually as worker but not
>> coordinator, right?).
>>
>> In our simplest scenario, user A is a very important customer and his
>> queries are relatively small, user B is a unimportant user who may issue
>> very large SQL to impala. It is unacceptable that the big query from user B
>> crash the impalad process and affect the user experiences of user A. So
>> resource isolation is the point.
>>
>> But per the Impala documents :
>> http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_admission.html
>> , Impala resource isolation is soft limit, cannot strictly prevent query
>> from user B affecting user A.
>>
>> As I know llama(run impala with yarn) is not recommended and we actually
>> tried it but disappointed about the performance and accuracy.
>>
>>        Is there any best practice for user resource isolation? So different
>> user will not affect each other.
>>
>>        Thanks.
>>
>>
>>
>> Best Regards,
>>
>> Songbo

Re: Impala user resource isolation best practice

Posted by Matthew Jacobs <mj...@cloudera.com>.
Hi Songbo,

Right now the best you can do is with admission control with:
(a) a single coordinator to avoid the possibility of over-admitting by
different coordinators
(b) setting default query mem limits so that individual queries are limited

For your scenario, I'd recommend setting up 2 pools, one for user A
and a second for user B. Set the max number of running queries for
user A to something reasonable for the concurrency for that workload.
Set the max memory for the user B pool to the portion of cluster
memory you're willing to give to those queries. (Notice the pool with
the small queries has the max number of running queries set and the
pool with the fewer but larger big queries has the max memory set --
that is intentional, the former is faster for admission but doesn't
limit based on memory.) How well this will work depends on how well
you can pick good numbers for these settings, which can be difficult
and requires studying your workload.

This isn't perfect resource isolation because rogue queries can still
consume too much CPU or other resources, but it's the best you'll be
able to do right now. In the future we will have better tools to make
this easier.

Best,
Matt

On Mon, Jul 18, 2016 at 2:59 AM, 廖松博 <li...@gridsum.com> wrote:
> Hello guys,
>
>
>
>        My Company is using Cloudera Impala as our basic infrastructure for
> online data analysis. The most difficult part we met is resource isolation
> and instability.
>
> According to our experiences in Impala, some big query which consume a vast
> amount of memory will crash impalad process(actually as worker but not
> coordinator, right?).
>
> In our simplest scenario, user A is a very important customer and his
> queries are relatively small, user B is a unimportant user who may issue
> very large SQL to impala. It is unacceptable that the big query from user B
> crash the impalad process and affect the user experiences of user A. So
> resource isolation is the point.
>
> But per the Impala documents :
> http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_admission.html
> , Impala resource isolation is soft limit, cannot strictly prevent query
> from user B affecting user A.
>
> As I know llama(run impala with yarn) is not recommended and we actually
> tried it but disappointed about the performance and accuracy.
>
>        Is there any best practice for user resource isolation? So different
> user will not affect each other.
>
>        Thanks.
>
>
>
> Best Regards,
>
> Songbo