You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Antoni Ivanov <ai...@vmware.com> on 2019/08/07 17:13:28 UTC

How to parse a query plan /summary/profile

Hi,

We'd like to get better visibility into way our Impala Cluster is used.
For example there's per node utilization - e.g sometimes fragments on a given node are slower, and this is visible in profile . Or there are some statistics available only in profile (like Runtime filters used or parquet file pruning stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we need to have the Thrift Schema at least I think)

Thanks,
Antoni


Re: How to parse a query plan /summary/profile

Posted by "Jenny Kwan (c)" <kj...@vmware.com>.
Thanks, Tim,

This is helpful. Currently, we (colleague of Antoni) are using Impyla, which Thrift gen’s as part of its egg/wheel building process (I assume). Internally, we’ll figure out how either match the Impyla Thrift version or build Impyla ourselves.

Does the JSON structure not match the nested Thrift structs?

Thanks,
Jenny

From: Tim Armstrong <ta...@cloudera.com>
Date: Friday, August 9, 2019 at 5:20 PM
To: "user@impala.apache.org" <us...@impala.apache.org>
Cc: "dev@impala.apache.org" <de...@impala.apache.org>, "Jenny Kwan (c)" <kj...@vmware.com>
Subject: Re: How to parse a query plan /summary/profile

Impala has two sets of information tracked on the coordinator node for each query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. A thrift format is theoretically available, but there is a bug: https://issues.apache.org/jira/browse/IMPALA-8252<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-8252&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062861464&sdata=8Bp9n2XQYK9ovyPTOwIW14Uk%2FT1k8kjWG9fPm8JNQwc%3D&reserved=0> , which is resolved in v3.2.0. So you need to have version >=3.2

The thrift format generally works fine, I know of a lot of tooling built on top of it (e.g. Cloudera Manager uses it extensively). The title of the JIRA sounds overly dramatic without context, basically we had some issues with compatibility across versions. You'll be fine if you use the .thrift file corresponding to the version of Impala you're consuming profiles from. It's messier if you have a tool that uses an old thrift file, since there were some issues with backward compatibility, or if you're trying to consume profiles from multiple versions of Impala.

There's a toy Python profile decoder in the impala source tree that may be useful to get started -https://github.com/apache/impala/blob/master/bin/parse-thrift-profile.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fimpala%2Fblob%2Fmaster%2Fbin%2Fparse-thrift-profile.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=%2BFFwJJToaoe2mGq8pyWD8su30HUmTXVkjO1mGpdha64%3D&reserved=0> and https://github.com/apache/impala/blob/24eab713a0d35f629509f59711f8a563e1346acf/lib/python/impala_py_lib/profiles.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fimpala%2Fblob%2F24eab713a0d35f629509f59711f8a563e1346acf%2Flib%2Fpython%2Fimpala_py_lib%2Fprofiles.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=orPAkIOFmMgJI1O3r275mctQ7TPHfDm5gZt%2BEziESgw%3D&reserved=0> . That just gets you from the base64-encoded strings to a thrift object.

A JSON format was added very recently (this week) into master - https://gerrit.cloudera.org/#/c/13801/<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgerrit.cloudera.org%2F%23%2Fc%2F13801%2F&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062871459&sdata=mmbTvBC1QGusshljOKyIHwDIW43SIbxTlWKREj03yxE%3D&reserved=0>. That's kinda experimental at the moment - we're not sure how convenient the current structure is without some experience actually using it - we'd welcome feedback about your use cases.

- Tim


On Fri, Aug 9, 2019 at 4:14 PM Antoni Ivanov <ai...@vmware.com>> wrote:
Hi,

We did some research on the topic, the answer we’ve come so far is

Impala has two sets of information tracked on the coordinator node for each query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. A thrift format is theoretically available, but there is a bug: https://issues.apache.org/jira/browse/IMPALA-8252<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FIMPALA-8252&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062881454&sdata=9sTEO3fKZKYaKA4iV46Psg39nfd0jdRDKZCK4q6n4Og%3D&reserved=0> , which is resolved in v3.2.0. So you need to have version >=3.2


After that Thrift Encoding form Twitter commons may be used –
https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftwitter%2Fcommons%2Fblob%2F06905dc0f1a26440a79ff1164831c85ce2d1bdf0%2Fsrc%2Fpython%2Ftwitter%2Fthrift%2Ftext%2Fthrift_json_encoder.py&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062881454&sdata=WvVwYpH8F03Kz69YSfFT2zxYKx9Pbq6iOWXCsxFQuVc%3D&reserved=0>


The thrift can be downloaded from Coordinator node e.g http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000 ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
The thrift can be downloaded from Cloudera REST API (if using Cloudera)
Or if using impyla<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcloudera%2Fimpyla&data=02%7C01%7Ckjenny%40vmware.com%7C9036c392cc8b4976a57a08d71d287bd2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637009932062891446&sdata=s3HrRr0DEh9u6Qnfg5mRlM5FI19oiDdxdGJo1Rusp4s%3D&reserved=0> Python library you can get the profile after execution
        cur.execute(sql)
        return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)


Just posting here in  case it’s helpful to anyone following the user group.

-Antoni

From: Antoni Ivanov
Sent: Wednesday, August 7, 2019 10:13 AM
To: user@impala.apache.org<ma...@impala.apache.org>
Cc: dev@impala <de...@impala.apache.org>>; Jenny Kwan (c) <kj...@vmware.com>>
Subject: How to parse a query plan /summary/profile

Hi,

We’d like to get better visibility into way our Impala Cluster is used.
For example there’s per node utilization – e.g sometimes fragments on a given node are slower, and this is visible in profile . Or there are some statistics available only in profile (like Runtime filters used or parquet file pruning stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we need to have the Thrift Schema at least I think)
Thanks,
Antoni


Re: How to parse a query plan /summary/profile

Posted by Tim Armstrong <ta...@cloudera.com>.
>
> Impala has two sets of information tracked on the coordinator node for
> each query: a summary and a profile.
> The profile is currently accessible as a string, which is unwieldy for
> parsing. A thrift format is theoretically available, but there is a bug:
> https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in
> v3.2.0. So you need to have version >=3.2


The thrift format generally works fine, I know of a lot of tooling built on
top of it (e.g. Cloudera Manager uses it extensively). The title of the
JIRA sounds overly dramatic without context, basically we had some issues
with compatibility across versions. You'll be fine if you use the .thrift
file corresponding to the version of Impala you're consuming profiles from.
It's messier if you have a tool that uses an old thrift file, since there
were some issues with backward compatibility, or if you're trying to
consume profiles from multiple versions of Impala.

There's a toy Python profile decoder in the impala source tree that may be
useful to get started -
https://github.com/apache/impala/blob/master/bin/parse-thrift-profile.py
 and
https://github.com/apache/impala/blob/24eab713a0d35f629509f59711f8a563e1346acf/lib/python/impala_py_lib/profiles.py
.
That just gets you from the base64-encoded strings to a thrift object.

A JSON format was added very recently (this week) into master -
https://gerrit.cloudera.org/#/c/13801/. That's kinda experimental at the
moment - we're not sure how convenient the current structure is without
some experience actually using it - we'd welcome feedback about your use
cases.

- Tim



On Fri, Aug 9, 2019 at 4:14 PM Antoni Ivanov <ai...@vmware.com> wrote:

> Hi,
>
>
>
> We did some research on the topic, the answer we’ve come so far is
>
>
>
> Impala has two sets of information tracked on the coordinator node for
> each query: a summary and a profile.
>
> The profile is currently accessible as a string, which is unwieldy for
> parsing. A thrift format is theoretically available, but there is a bug:
> https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in
> v3.2.0. So you need to have version >=3.2
>
>
>
>
>
> After that Thrift Encoding form Twitter commons may be used –
>
>
> https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py
>
>
>
>
>
> The thrift can be downloaded from Coordinator node e.g
> http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000
> ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
>
> The thrift can be downloaded from Cloudera REST API (if using Cloudera)
> Or if using impyla <https://github.com/cloudera/impyla> Python library
> you can get the profile after execution
>
>         cur.execute(sql)
>
>         return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)
>
>
>
>
>
> Just posting here in  case it’s helpful to anyone following the user
> group.
>
>
>
> -Antoni
>
>
>
> *From:* Antoni Ivanov
> *Sent:* Wednesday, August 7, 2019 10:13 AM
> *To:* user@impala.apache.org
> *Cc:* dev@impala <de...@impala.apache.org>; Jenny Kwan (c) <
> kjenny@vmware.com>
> *Subject:* How to parse a query plan /summary/profile
>
>
>
> Hi,
>
>
>
> We’d like to get better visibility into way our Impala Cluster is used.
>
> For example there’s per node utilization – e.g sometimes fragments on a
> given node are slower, and this is visible in profile . Or there are some
> statistics available only in profile (like Runtime filters used or parquet
> file pruning stats)
>
>
>
> I think you can download it as a Thrift ? But is it easily de-serializable
> (we need to have the Thrift Schema at least I think)
>
> Thanks,
>
> Antoni
>
>
>

Re: How to parse a query plan /summary/profile

Posted by Tim Armstrong <ta...@cloudera.com>.
>
> Impala has two sets of information tracked on the coordinator node for
> each query: a summary and a profile.
> The profile is currently accessible as a string, which is unwieldy for
> parsing. A thrift format is theoretically available, but there is a bug:
> https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in
> v3.2.0. So you need to have version >=3.2


The thrift format generally works fine, I know of a lot of tooling built on
top of it (e.g. Cloudera Manager uses it extensively). The title of the
JIRA sounds overly dramatic without context, basically we had some issues
with compatibility across versions. You'll be fine if you use the .thrift
file corresponding to the version of Impala you're consuming profiles from.
It's messier if you have a tool that uses an old thrift file, since there
were some issues with backward compatibility, or if you're trying to
consume profiles from multiple versions of Impala.

There's a toy Python profile decoder in the impala source tree that may be
useful to get started -
https://github.com/apache/impala/blob/master/bin/parse-thrift-profile.py
 and
https://github.com/apache/impala/blob/24eab713a0d35f629509f59711f8a563e1346acf/lib/python/impala_py_lib/profiles.py
.
That just gets you from the base64-encoded strings to a thrift object.

A JSON format was added very recently (this week) into master -
https://gerrit.cloudera.org/#/c/13801/. That's kinda experimental at the
moment - we're not sure how convenient the current structure is without
some experience actually using it - we'd welcome feedback about your use
cases.

- Tim



On Fri, Aug 9, 2019 at 4:14 PM Antoni Ivanov <ai...@vmware.com> wrote:

> Hi,
>
>
>
> We did some research on the topic, the answer we’ve come so far is
>
>
>
> Impala has two sets of information tracked on the coordinator node for
> each query: a summary and a profile.
>
> The profile is currently accessible as a string, which is unwieldy for
> parsing. A thrift format is theoretically available, but there is a bug:
> https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in
> v3.2.0. So you need to have version >=3.2
>
>
>
>
>
> After that Thrift Encoding form Twitter commons may be used –
>
>
> https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py
>
>
>
>
>
> The thrift can be downloaded from Coordinator node e.g
> http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000
> ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
>
> The thrift can be downloaded from Cloudera REST API (if using Cloudera)
> Or if using impyla <https://github.com/cloudera/impyla> Python library
> you can get the profile after execution
>
>         cur.execute(sql)
>
>         return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)
>
>
>
>
>
> Just posting here in  case it’s helpful to anyone following the user
> group.
>
>
>
> -Antoni
>
>
>
> *From:* Antoni Ivanov
> *Sent:* Wednesday, August 7, 2019 10:13 AM
> *To:* user@impala.apache.org
> *Cc:* dev@impala <de...@impala.apache.org>; Jenny Kwan (c) <
> kjenny@vmware.com>
> *Subject:* How to parse a query plan /summary/profile
>
>
>
> Hi,
>
>
>
> We’d like to get better visibility into way our Impala Cluster is used.
>
> For example there’s per node utilization – e.g sometimes fragments on a
> given node are slower, and this is visible in profile . Or there are some
> statistics available only in profile (like Runtime filters used or parquet
> file pruning stats)
>
>
>
> I think you can download it as a Thrift ? But is it easily de-serializable
> (we need to have the Thrift Schema at least I think)
>
> Thanks,
>
> Antoni
>
>
>

RE: How to parse a query plan /summary/profile

Posted by Antoni Ivanov <ai...@vmware.com>.
Hi,

We did some research on the topic, the answer we've come so far is

Impala has two sets of information tracked on the coordinator node for each query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. A thrift format is theoretically available, but there is a bug: https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in v3.2.0. So you need to have version >=3.2


After that Thrift Encoding form Twitter commons may be used -
https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py


The thrift can be downloaded from Coordinator node e.g http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000 ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
The thrift can be downloaded from Cloudera REST API (if using Cloudera)
Or if using impyla<https://github.com/cloudera/impyla> Python library you can get the profile after execution
        cur.execute(sql)
        return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)


Just posting here in  case it's helpful to anyone following the user group.

-Antoni

From: Antoni Ivanov
Sent: Wednesday, August 7, 2019 10:13 AM
To: user@impala.apache.org
Cc: dev@impala <de...@impala.apache.org>; Jenny Kwan (c) <kj...@vmware.com>
Subject: How to parse a query plan /summary/profile

Hi,

We'd like to get better visibility into way our Impala Cluster is used.
For example there's per node utilization - e.g sometimes fragments on a given node are slower, and this is visible in profile . Or there are some statistics available only in profile (like Runtime filters used or parquet file pruning stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we need to have the Thrift Schema at least I think)
Thanks,
Antoni


RE: How to parse a query plan /summary/profile

Posted by Antoni Ivanov <ai...@vmware.com>.
Hi,

We did some research on the topic, the answer we've come so far is

Impala has two sets of information tracked on the coordinator node for each query: a summary and a profile.
The profile is currently accessible as a string, which is unwieldy for parsing. A thrift format is theoretically available, but there is a bug: https://issues.apache.org/jira/browse/IMPALA-8252 , which is resolved in v3.2.0. So you need to have version >=3.2


After that Thrift Encoding form Twitter commons may be used -
https://github.com/twitter/commons/blob/06905dc0f1a26440a79ff1164831c85ce2d1bdf0/src/python/twitter/thrift/text/thrift_json_encoder.py


The thrift can be downloaded from Coordinator node e.g http://coord-node:25000/query_profile_encoded?query_id=442c057197d9c0d:81810ccd00000000 ( 442c057197d9c0d:81810ccd00000000 is the Query ID)
The thrift can be downloaded from Cloudera REST API (if using Cloudera)
Or if using impyla<https://github.com/cloudera/impyla> Python library you can get the profile after execution
        cur.execute(sql)
        return cur.get_profile(profile_format=TRuntimeProfileFormat.THRIFT)


Just posting here in  case it's helpful to anyone following the user group.

-Antoni

From: Antoni Ivanov
Sent: Wednesday, August 7, 2019 10:13 AM
To: user@impala.apache.org
Cc: dev@impala <de...@impala.apache.org>; Jenny Kwan (c) <kj...@vmware.com>
Subject: How to parse a query plan /summary/profile

Hi,

We'd like to get better visibility into way our Impala Cluster is used.
For example there's per node utilization - e.g sometimes fragments on a given node are slower, and this is visible in profile . Or there are some statistics available only in profile (like Runtime filters used or parquet file pruning stats)

I think you can download it as a Thrift ? But is it easily de-serializable (we need to have the Thrift Schema at least I think)
Thanks,
Antoni