You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Todd Lipcon <to...@cloudera.com.INVALID> on 2018/08/07 07:33:59 UTC

Update on catalog changes

Hi folks,

It's been a few weeks since the last update on this topic, so wanted to
check in and share the progress as well as some plans for the coming weeks.

As far as progress is concerned, most of the work so far committed (or
almost-committed) has been some extensive refactoring. You may notice that
the majority of the Frontend now refers to catalog objects by interfaces
such as FeTable, FeDb, FeFsTable, etc, rather than the concrete classes
Table, Db, HdfsTable, etc. In addition, the catalog itself uses an
interface instead of a specific implementation. A new impalad flag
--use_local_catalog can be flipped on to switch in the "new" impalad
catalog implementation.

One notable exception to the above refactoring is DDL statements such as
CREATE/DROP/ALTER. As we've worked on the project it's become clear that
the DDL implementation is quite tightly interwoven with the concrete
implementations since most of the operations make some attempt to
incrementally modify catalog objects in-place to reflect the changes being
made in external source systems. Additionally, the treatment of the catalog
versioning and metadata propagation protocol is pretty delicate. Given
that, we made a decision a few weeks back to continue to delegate DDL
functions to the catalogd by the existing mechanisms.

This turned up an interesting new set of problems. Specifically, if the
impalad is fetching metadata directly from source systems, but the catalogd
continues to operate with its "snapshotted" table-granularity caches of
objects, it's possible (and even likely) that an impalad will have a
*newer* view of metadata than the catalogd. This can result in very
confusing scenarios, like:

1) a user creates a table through some external system like Hive or Spark
2) a user is able to see and query the table on the impalad, since it
fetched the table metadata direct from HMS
3) the user wants to drop the table from Impala. When it sends the request
to the catalogd, the catalogd will respond with a "table does not exist"
error.
4) the user, confused, may attempt to "show tables" again, and will
continue to see it existing.

At this point the only way to resolve the inconsistency would be some set
of invalidate queries. There are many other similar scenarios where a user
may get unexpected results or errors from DDLs based on the catalogd having
an older cached copy than the impalad is showing.

One way to fix these issues would be to fully re-implement all of the DDL
statements without the tight interleaving with catalogd code. However, that
will take a certain amount of time and bring more risk due to the
complexity of those code paths.

Additionally, during the design phase there were various concerns raised
around the "fetch directly from source systems" approach, including:
1) increased load on source systems
2) potential for small semantics changes such as users who might rely on
REFRESH "freezing" the view of a table
3) loss of various fringe features such as non-persistent functions, which
currently are _only_ saved in catalogd memory
4) potentially more difficult to later implement "subscribe to source
system" type functionality where new files or objects are discovered
automatically

Given the above, plus the risks/effort of re-implementing DDL operations
decoupled from catalogd, I'm currently proposing that we shift the design
of the project a bit. Namely, instead of being "fetch granular metadata
on-demand from source systems with no catalogd", it will become "fetch
granular metadata on-demand from catalogd". The vast majority of the
refactoring work detailed at the top of this email is still relevant: the
planner still needs to operate on lighter-weight objects with different
properties than the catalog, and the machinery to switch around the
behavior still makes sense. It's just that the actual _source_ of metadata
will be a new RPC on the catalogd which allows on-demand granular metadata
retrieval.

A few other key points of this design are:

- the impalads, when they fetch metadata, can note the version number of
the catalog object that they fetched. This allows them to be sure that they
will never get "read skew" in their cache -- all pieces of metadata for a
given table need to have been read from the same version. Invalidation is
also much easier with these consistent version numbers.
- the catalogd, instead of broadcasting full catalog objects through the
statestore, now just needs to send out notices of object version number
changes. For example "table:functional.alltypes -> version 123". This
message can cause the impalads to invalidate the cache for that table for
any version less than 123, and all future queries are sure to re-fetch the
latest data on-demand.
- ideally we can make partition objects immutable, so that, given a
particular partition ID, we know that that ID will never be reused with
different data. On a REFRESH or other partition metadata change, a new ID
can be assigned to the partition. This allows us to safely cache partitions
(the largest bit of metadata) with long TTLs, and to do very cheap updates
after any update affects only a subset of partitions.

In terms of the original "fetch from source systems" design, I do continue
to think it has promise longer-term as it simplifies the overall
architecture and can help with cross-system metadata consistency. But we
can treat fetch-from-catalogd as a nice interim that should bring most of
the performance and scalability benefits to users sooner and with less risk.

I'll plan to update the original design document to reflect this in coming
days.

Thanks
-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Update on catalog changes

Posted by Philip Zeyliger <ph...@cloudera.com.INVALID>.
Thanks for the update, Todd!

-- Philip

On Tue, Aug 7, 2018 at 12:34 AM Todd Lipcon <to...@cloudera.com.invalid>
wrote:

> Hi folks,
>
> It's been a few weeks since the last update on this topic, so wanted to
> check in and share the progress as well as some plans for the coming weeks.
>
> As far as progress is concerned, most of the work so far committed (or
> almost-committed) has been some extensive refactoring. You may notice that
> the majority of the Frontend now refers to catalog objects by interfaces
> such as FeTable, FeDb, FeFsTable, etc, rather than the concrete classes
> Table, Db, HdfsTable, etc. In addition, the catalog itself uses an
> interface instead of a specific implementation. A new impalad flag
> --use_local_catalog can be flipped on to switch in the "new" impalad
> catalog implementation.
>
> One notable exception to the above refactoring is DDL statements such as
> CREATE/DROP/ALTER. As we've worked on the project it's become clear that
> the DDL implementation is quite tightly interwoven with the concrete
> implementations since most of the operations make some attempt to
> incrementally modify catalog objects in-place to reflect the changes being
> made in external source systems. Additionally, the treatment of the catalog
> versioning and metadata propagation protocol is pretty delicate. Given
> that, we made a decision a few weeks back to continue to delegate DDL
> functions to the catalogd by the existing mechanisms.
>
> This turned up an interesting new set of problems. Specifically, if the
> impalad is fetching metadata directly from source systems, but the catalogd
> continues to operate with its "snapshotted" table-granularity caches of
> objects, it's possible (and even likely) that an impalad will have a
> *newer* view of metadata than the catalogd. This can result in very
> confusing scenarios, like:
>
> 1) a user creates a table through some external system like Hive or Spark
> 2) a user is able to see and query the table on the impalad, since it
> fetched the table metadata direct from HMS
> 3) the user wants to drop the table from Impala. When it sends the request
> to the catalogd, the catalogd will respond with a "table does not exist"
> error.
> 4) the user, confused, may attempt to "show tables" again, and will
> continue to see it existing.
>
> At this point the only way to resolve the inconsistency would be some set
> of invalidate queries. There are many other similar scenarios where a user
> may get unexpected results or errors from DDLs based on the catalogd having
> an older cached copy than the impalad is showing.
>
> One way to fix these issues would be to fully re-implement all of the DDL
> statements without the tight interleaving with catalogd code. However, that
> will take a certain amount of time and bring more risk due to the
> complexity of those code paths.
>
> Additionally, during the design phase there were various concerns raised
> around the "fetch directly from source systems" approach, including:
> 1) increased load on source systems
> 2) potential for small semantics changes such as users who might rely on
> REFRESH "freezing" the view of a table
> 3) loss of various fringe features such as non-persistent functions, which
> currently are _only_ saved in catalogd memory
> 4) potentially more difficult to later implement "subscribe to source
> system" type functionality where new files or objects are discovered
> automatically
>
> Given the above, plus the risks/effort of re-implementing DDL operations
> decoupled from catalogd, I'm currently proposing that we shift the design
> of the project a bit. Namely, instead of being "fetch granular metadata
> on-demand from source systems with no catalogd", it will become "fetch
> granular metadata on-demand from catalogd". The vast majority of the
> refactoring work detailed at the top of this email is still relevant: the
> planner still needs to operate on lighter-weight objects with different
> properties than the catalog, and the machinery to switch around the
> behavior still makes sense. It's just that the actual _source_ of metadata
> will be a new RPC on the catalogd which allows on-demand granular metadata
> retrieval.
>
> A few other key points of this design are:
>
> - the impalads, when they fetch metadata, can note the version number of
> the catalog object that they fetched. This allows them to be sure that they
> will never get "read skew" in their cache -- all pieces of metadata for a
> given table need to have been read from the same version. Invalidation is
> also much easier with these consistent version numbers.
> - the catalogd, instead of broadcasting full catalog objects through the
> statestore, now just needs to send out notices of object version number
> changes. For example "table:functional.alltypes -> version 123". This
> message can cause the impalads to invalidate the cache for that table for
> any version less than 123, and all future queries are sure to re-fetch the
> latest data on-demand.
> - ideally we can make partition objects immutable, so that, given a
> particular partition ID, we know that that ID will never be reused with
> different data. On a REFRESH or other partition metadata change, a new ID
> can be assigned to the partition. This allows us to safely cache partitions
> (the largest bit of metadata) with long TTLs, and to do very cheap updates
> after any update affects only a subset of partitions.
>
> In terms of the original "fetch from source systems" design, I do continue
> to think it has promise longer-term as it simplifies the overall
> architecture and can help with cross-system metadata consistency. But we
> can treat fetch-from-catalogd as a nice interim that should bring most of
> the performance and scalability benefits to users sooner and with less
> risk.
>
> I'll plan to update the original design document to reflect this in coming
> days.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>