You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Bharath Vissapragada <bh...@cloudera.com> on 2017/12/08 18:00:14 UTC

Re: Questions about Statestore and Catalogservice

Looks like a topic for dev@.

On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <la...@gmail.com> wrote:

> Hi,
>
> I'm trying to understand how the communication between the components
> works.
>
> I understand that an impala daemon subscribes to the statestore. The
> statestore seems to have the concept of heartbeats and topics. But I'm not
> sure what topics are all about.
>

Statestore follows the standard pub-sub pattern
<https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern> where a
publisher publishes messages and subscribers subscribe to the
messages/categories they are interested in.  Like you mentioned, statestore
is like a mediator between the publishers and the subscribers.

"Topic" is an abstraction that makes the content of these messages opaque
to the statestore. The publishers (like Catalog server for example)
serialize the messages (metadata for example) into a "Topic" to ship them
to the statestore which then broadcasts that to the interested subscribers
(coordinators). The coordinators then unpack/deserialize the topic into the
corresponding object classes (like Tables/Functions etc.) and apply those
updates locally.

In Impala, currently we have the following topics:

catalog-update - For Catalog metadata
impala-membership - For tracking liveness of the coordinators/executors
impala-request-queue -  For admission control

You can see these in the statestore web UI (/topics page)

>
>
> The docs also say that only the statestore communicates with the catalog
> service. How does that happen?
>

Can you point us to which doc you are referring to here?

Techincally speaking, the coordinators also connect to the Catalog service
for executing DDLs, but I'm assuming you are speaking here in terms of the
broadcast of the table updates, in which case Catalog sends those tables to
the statestore (as a part of catalog-update topic) and those are broadcast
by the statestore to all the coordinators. (described above)

How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
service and back?

I'll take the example of REFRESH here.  The metadata flow looks something
like this

- coordinator 'coo' gets 'refresh foo'
- 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
- 'cat' refreshes the table 'foo', which changes the version of 'foo' from
v1 to v2 (Internally Catalog versions all the objects to track which
objects changed over time)
- 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
result of RPC) which then applies the update locally.
- Additionally 'cat' also has a thread running in the background that
figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
into a "Topic" update and sends it to the statestore.
- Statestore then broadcasts the new updates to all the coordinators.

INVALIDATE is slightly different in the sense that the coordinator doesn't
get foo(v2) back as the result of the rpc, instead it gets an
"IncompleteTable" (Impala terminology) which means that the table is either
missing the catalog metadata/it has been invalidated.

There are many minor details on how the entire system works but "most"
Catalog updates work as above (with some exceptions).


> I'm sure I'll have follow-up questions but this would already be very
> helpful. Thank you!
>

Sure, feel free to ask the list. Here are some code pointers incase you are
interested.

https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h
(Topic/TopicEntry and other SS  abstractions)

https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45
(thrift definitions for most Catalog operations)

https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324
(how coordinators apply the Catalog updates)

https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187
(An example of how "catalog-update" is created & subscribed)

HTH.


> Cheers,
> Lars
>

Re: Questions about Statestore and Catalogservice

Posted by Jeszy <je...@gmail.com>.
Thanks for pointing out the docs issue! I opened IMPALA-6303 to track it.

On 10 December 2017 at 15:47, Lars Francke <la...@gmail.com> wrote:
> Thank you Bharath & Dimitris!
>
> That answers all the questions I have right now, thank you so much for
> taking the time to write it up.
>
> Regarding the docs:
> <https://impala.apache.org/docs/build/html/topics/impala_components.html>
>
>> The Impala component known as the catalog service relays the metadata
>> changes from Impala SQL statements to all the DataNodes in a cluster. It is
>> physically represented by a daemon process named catalogd; you only need
>> such a process on one host in the cluster. Because the requests are passed
>> through the statestore daemon, it makes sense to run the statestored and
>> catalogd services on the same host.
>
> Reading it again now it also says "DataNodes" which is not correct.
>
> Cheers,
> Lars
>
>
> On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bh...@cloudera.com>
> wrote:
>>
>> Looks like a topic for dev@.
>>
>> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <la...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to understand how the communication between the components
>>> works.
>>>
>>> I understand that an impala daemon subscribes to the statestore. The
>>> statestore seems to have the concept of heartbeats and topics. But I'm not
>>> sure what topics are all about.
>>
>>
>> Statestore follows the standard pub-sub pattern where a publisher
>> publishes messages and subscribers subscribe to the messages/categories they
>> are interested in.  Like you mentioned, statestore is like a mediator
>> between the publishers and the subscribers.
>>
>> "Topic" is an abstraction that makes the content of these messages opaque
>> to the statestore. The publishers (like Catalog server for example)
>> serialize the messages (metadata for example) into a "Topic" to ship them to
>> the statestore which then broadcasts that to the interested subscribers
>> (coordinators). The coordinators then unpack/deserialize the topic into the
>> corresponding object classes (like Tables/Functions etc.) and apply those
>> updates locally.
>>
>> In Impala, currently we have the following topics:
>>
>> catalog-update - For Catalog metadata
>> impala-membership - For tracking liveness of the coordinators/executors
>> impala-request-queue -  For admission control
>>
>> You can see these in the statestore web UI (/topics page)
>>
>>>
>>>
>>> The docs also say that only the statestore communicates with the catalog
>>> service. How does that happen?
>>
>>
>> Can you point us to which doc you are referring to here?
>>
>> Techincally speaking, the coordinators also connect to the Catalog service
>> for executing DDLs, but I'm assuming you are speaking here in terms of the
>> broadcast of the table updates, in which case Catalog sends those tables to
>> the statestore (as a part of catalog-update topic) and those are broadcast
>> by the statestore to all the coordinators. (described above)
>>
>> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
>> service and back?
>>
>> I'll take the example of REFRESH here.  The metadata flow looks something
>> like this
>>
>> - coordinator 'coo' gets 'refresh foo'
>> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
>> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
>> v1 to v2 (Internally Catalog versions all the objects to track which objects
>> changed over time)
>> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
>> result of RPC) which then applies the update locally.
>> - Additionally 'cat' also has a thread running in the background that
>> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
>> into a "Topic" update and sends it to the statestore.
>> - Statestore then broadcasts the new updates to all the coordinators.
>>
>> INVALIDATE is slightly different in the sense that the coordinator doesn't
>> get foo(v2) back as the result of the rpc, instead it gets an
>> "IncompleteTable" (Impala terminology) which means that the table is either
>> missing the catalog metadata/it has been invalidated.
>>
>> There are many minor details on how the entire system works but "most"
>> Catalog updates work as above (with some exceptions).
>>
>>>
>>> I'm sure I'll have follow-up questions but this would already be very
>>> helpful. Thank you!
>>
>>
>> Sure, feel free to ask the list. Here are some code pointers incase you
>> are interested.
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h
>> (Topic/TopicEntry and other SS  abstractions)
>>
>>
>> https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45
>> (thrift definitions for most Catalog operations)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324
>> (how coordinators apply the Catalog updates)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187
>> (An example of how "catalog-update" is created & subscribed)
>>
>> HTH.
>>
>>>
>>> Cheers,
>>> Lars
>>
>>
>

Re: Questions about Statestore and Catalogservice

Posted by Jeszy <je...@gmail.com>.
Thanks for pointing out the docs issue! I opened IMPALA-6303 to track it.

On 10 December 2017 at 15:47, Lars Francke <la...@gmail.com> wrote:
> Thank you Bharath & Dimitris!
>
> That answers all the questions I have right now, thank you so much for
> taking the time to write it up.
>
> Regarding the docs:
> <https://impala.apache.org/docs/build/html/topics/impala_components.html>
>
>> The Impala component known as the catalog service relays the metadata
>> changes from Impala SQL statements to all the DataNodes in a cluster. It is
>> physically represented by a daemon process named catalogd; you only need
>> such a process on one host in the cluster. Because the requests are passed
>> through the statestore daemon, it makes sense to run the statestored and
>> catalogd services on the same host.
>
> Reading it again now it also says "DataNodes" which is not correct.
>
> Cheers,
> Lars
>
>
> On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bh...@cloudera.com>
> wrote:
>>
>> Looks like a topic for dev@.
>>
>> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <la...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to understand how the communication between the components
>>> works.
>>>
>>> I understand that an impala daemon subscribes to the statestore. The
>>> statestore seems to have the concept of heartbeats and topics. But I'm not
>>> sure what topics are all about.
>>
>>
>> Statestore follows the standard pub-sub pattern where a publisher
>> publishes messages and subscribers subscribe to the messages/categories they
>> are interested in.  Like you mentioned, statestore is like a mediator
>> between the publishers and the subscribers.
>>
>> "Topic" is an abstraction that makes the content of these messages opaque
>> to the statestore. The publishers (like Catalog server for example)
>> serialize the messages (metadata for example) into a "Topic" to ship them to
>> the statestore which then broadcasts that to the interested subscribers
>> (coordinators). The coordinators then unpack/deserialize the topic into the
>> corresponding object classes (like Tables/Functions etc.) and apply those
>> updates locally.
>>
>> In Impala, currently we have the following topics:
>>
>> catalog-update - For Catalog metadata
>> impala-membership - For tracking liveness of the coordinators/executors
>> impala-request-queue -  For admission control
>>
>> You can see these in the statestore web UI (/topics page)
>>
>>>
>>>
>>> The docs also say that only the statestore communicates with the catalog
>>> service. How does that happen?
>>
>>
>> Can you point us to which doc you are referring to here?
>>
>> Techincally speaking, the coordinators also connect to the Catalog service
>> for executing DDLs, but I'm assuming you are speaking here in terms of the
>> broadcast of the table updates, in which case Catalog sends those tables to
>> the statestore (as a part of catalog-update topic) and those are broadcast
>> by the statestore to all the coordinators. (described above)
>>
>> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
>> service and back?
>>
>> I'll take the example of REFRESH here.  The metadata flow looks something
>> like this
>>
>> - coordinator 'coo' gets 'refresh foo'
>> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
>> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
>> v1 to v2 (Internally Catalog versions all the objects to track which objects
>> changed over time)
>> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
>> result of RPC) which then applies the update locally.
>> - Additionally 'cat' also has a thread running in the background that
>> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
>> into a "Topic" update and sends it to the statestore.
>> - Statestore then broadcasts the new updates to all the coordinators.
>>
>> INVALIDATE is slightly different in the sense that the coordinator doesn't
>> get foo(v2) back as the result of the rpc, instead it gets an
>> "IncompleteTable" (Impala terminology) which means that the table is either
>> missing the catalog metadata/it has been invalidated.
>>
>> There are many minor details on how the entire system works but "most"
>> Catalog updates work as above (with some exceptions).
>>
>>>
>>> I'm sure I'll have follow-up questions but this would already be very
>>> helpful. Thank you!
>>
>>
>> Sure, feel free to ask the list. Here are some code pointers incase you
>> are interested.
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h
>> (Topic/TopicEntry and other SS  abstractions)
>>
>>
>> https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45
>> (thrift definitions for most Catalog operations)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324
>> (how coordinators apply the Catalog updates)
>>
>>
>> https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187
>> (An example of how "catalog-update" is created & subscribed)
>>
>> HTH.
>>
>>>
>>> Cheers,
>>> Lars
>>
>>
>

Re: Questions about Statestore and Catalogservice

Posted by Lars Francke <la...@gmail.com>.
Thank you Bharath & Dimitris!

That answers all the questions I have right now, thank you so much for
taking the time to write it up.

Regarding the docs: <
https://impala.apache.org/docs/build/html/topics/impala_components.html>

> The Impala component known as the catalog service relays the metadata
changes from Impala SQL statements to all the DataNodes in a cluster. It is
physically represented by a daemon process named catalogd; you only need
such a process on one host in the cluster. Because the requests are passed
through the statestore daemon, it makes sense to run the statestored and
catalogd services on the same host.

Reading it again now it also says "DataNodes" which is not correct.

Cheers,
Lars


On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bh...@cloudera.com>
wrote:

> Looks like a topic for dev@.
>
> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <la...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to understand how the communication between the components
>> works.
>>
>> I understand that an impala daemon subscribes to the statestore. The
>> statestore seems to have the concept of heartbeats and topics. But I'm not
>> sure what topics are all about.
>>
>
> Statestore follows the standard pub-sub pattern
> <https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern> where a
> publisher publishes messages and subscribers subscribe to the
> messages/categories they are interested in.  Like you mentioned, statestore
> is like a mediator between the publishers and the subscribers.
>
> "Topic" is an abstraction that makes the content of these messages opaque
> to the statestore. The publishers (like Catalog server for example)
> serialize the messages (metadata for example) into a "Topic" to ship them
> to the statestore which then broadcasts that to the interested subscribers
> (coordinators). The coordinators then unpack/deserialize the topic into the
> corresponding object classes (like Tables/Functions etc.) and apply those
> updates locally.
>
> In Impala, currently we have the following topics:
>
> catalog-update - For Catalog metadata
> impala-membership - For tracking liveness of the coordinators/executors
> impala-request-queue -  For admission control
>
> You can see these in the statestore web UI (/topics page)
>
>>
>>
>> The docs also say that only the statestore communicates with the catalog
>> service. How does that happen?
>>
>
> Can you point us to which doc you are referring to here?
>
> Techincally speaking, the coordinators also connect to the Catalog service
> for executing DDLs, but I'm assuming you are speaking here in terms of the
> broadcast of the table updates, in which case Catalog sends those tables to
> the statestore (as a part of catalog-update topic) and those are broadcast
> by the statestore to all the coordinators. (described above)
>
> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
> service and back?
>
> I'll take the example of REFRESH here.  The metadata flow looks something
> like this
>
> - coordinator 'coo' gets 'refresh foo'
> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
> v1 to v2 (Internally Catalog versions all the objects to track which
> objects changed over time)
> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
> result of RPC) which then applies the update locally.
> - Additionally 'cat' also has a thread running in the background that
> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
> into a "Topic" update and sends it to the statestore.
> - Statestore then broadcasts the new updates to all the coordinators.
>
> INVALIDATE is slightly different in the sense that the coordinator doesn't
> get foo(v2) back as the result of the rpc, instead it gets an
> "IncompleteTable" (Impala terminology) which means that the table is either
> missing the catalog metadata/it has been invalidated.
>
> There are many minor details on how the entire system works but "most"
> Catalog updates work as above (with some exceptions).
>
>
>> I'm sure I'll have follow-up questions but this would already be very
>> helpful. Thank you!
>>
>
> Sure, feel free to ask the list. Here are some code pointers incase you
> are interested.
>
> https://github.com/apache/impala/blob/master/be/src/
> statestore/statestore.h (Topic/TopicEntry and other SS  abstractions)
>
> https://github.com/apache/impala/blob/master/common/
> thrift/CatalogService.thrift#L45 (thrift definitions for most Catalog
> operations)
>
> https://github.com/apache/impala/blob/master/be/src/
> service/impala-server.h#L324 (how coordinators apply the Catalog updates)
>
> https://github.com/apache/impala/blob/master/be/src/
> catalog/catalog-server.cc#L187 (An example of how "catalog-update" is
> created & subscribed)
>
> HTH.
>
>
>> Cheers,
>> Lars
>>
>
>

Re: Questions about Statestore and Catalogservice

Posted by Lars Francke <la...@gmail.com>.
Thank you Bharath & Dimitris!

That answers all the questions I have right now, thank you so much for
taking the time to write it up.

Regarding the docs: <
https://impala.apache.org/docs/build/html/topics/impala_components.html>

> The Impala component known as the catalog service relays the metadata
changes from Impala SQL statements to all the DataNodes in a cluster. It is
physically represented by a daemon process named catalogd; you only need
such a process on one host in the cluster. Because the requests are passed
through the statestore daemon, it makes sense to run the statestored and
catalogd services on the same host.

Reading it again now it also says "DataNodes" which is not correct.

Cheers,
Lars


On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <bh...@cloudera.com>
wrote:

> Looks like a topic for dev@.
>
> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <la...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to understand how the communication between the components
>> works.
>>
>> I understand that an impala daemon subscribes to the statestore. The
>> statestore seems to have the concept of heartbeats and topics. But I'm not
>> sure what topics are all about.
>>
>
> Statestore follows the standard pub-sub pattern
> <https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern> where a
> publisher publishes messages and subscribers subscribe to the
> messages/categories they are interested in.  Like you mentioned, statestore
> is like a mediator between the publishers and the subscribers.
>
> "Topic" is an abstraction that makes the content of these messages opaque
> to the statestore. The publishers (like Catalog server for example)
> serialize the messages (metadata for example) into a "Topic" to ship them
> to the statestore which then broadcasts that to the interested subscribers
> (coordinators). The coordinators then unpack/deserialize the topic into the
> corresponding object classes (like Tables/Functions etc.) and apply those
> updates locally.
>
> In Impala, currently we have the following topics:
>
> catalog-update - For Catalog metadata
> impala-membership - For tracking liveness of the coordinators/executors
> impala-request-queue -  For admission control
>
> You can see these in the statestore web UI (/topics page)
>
>>
>>
>> The docs also say that only the statestore communicates with the catalog
>> service. How does that happen?
>>
>
> Can you point us to which doc you are referring to here?
>
> Techincally speaking, the coordinators also connect to the Catalog service
> for executing DDLs, but I'm assuming you are speaking here in terms of the
> broadcast of the table updates, in which case Catalog sends those tables to
> the statestore (as a part of catalog-update topic) and those are broadcast
> by the statestore to all the coordinators. (described above)
>
> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog
> service and back?
>
> I'll take the example of REFRESH here.  The metadata flow looks something
> like this
>
> - coordinator 'coo' gets 'refresh foo'
> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh'
> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from
> v1 to v2 (Internally Catalog versions all the objects to track which
> objects changed over time)
> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo'  (as the
> result of RPC) which then applies the update locally.
> - Additionally 'cat' also has a thread running in the background that
> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo'
> into a "Topic" update and sends it to the statestore.
> - Statestore then broadcasts the new updates to all the coordinators.
>
> INVALIDATE is slightly different in the sense that the coordinator doesn't
> get foo(v2) back as the result of the rpc, instead it gets an
> "IncompleteTable" (Impala terminology) which means that the table is either
> missing the catalog metadata/it has been invalidated.
>
> There are many minor details on how the entire system works but "most"
> Catalog updates work as above (with some exceptions).
>
>
>> I'm sure I'll have follow-up questions but this would already be very
>> helpful. Thank you!
>>
>
> Sure, feel free to ask the list. Here are some code pointers incase you
> are interested.
>
> https://github.com/apache/impala/blob/master/be/src/
> statestore/statestore.h (Topic/TopicEntry and other SS  abstractions)
>
> https://github.com/apache/impala/blob/master/common/
> thrift/CatalogService.thrift#L45 (thrift definitions for most Catalog
> operations)
>
> https://github.com/apache/impala/blob/master/be/src/
> service/impala-server.h#L324 (how coordinators apply the Catalog updates)
>
> https://github.com/apache/impala/blob/master/be/src/
> catalog/catalog-server.cc#L187 (An example of how "catalog-update" is
> created & subscribed)
>
> HTH.
>
>
>> Cheers,
>> Lars
>>
>
>