You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/06/30 12:47:00 UTC
[jira] [Commented] (KUDU-3135) Add Client Metadata Tokens

    [ https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148629#comment-17148629 ] 

ASF subversion and git services commented on KUDU-3135:
-------------------------------------------------------

Commit d23ee5d38ddc4317f431dd65df0c825c00cc968a in kudu's branch refs/heads/master from Grant Henke
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=d23ee5d ]

KUDU-1802: Avoid calls to master when using scan tokens

This patch adds new metadata to the scan token to allow it
to contain all of the metadata required to construct a KuduTable
and open a scanner in the clients. This means the GetTableSchema
and GetTableLocations RPC calls to the master are no longer required
when using the scan token.

New TableMetadataPB, TabletMetadataPB, and authorization token
fields were added as optional fields on the token. Additionally a
`projected_column_idx` field was added that can be used in place
of the `projected_columns`. This significantly reduces the size of
the scan token by not duplicating the ColumnSchemaPB that is
already in the TableMetadataPB.

Adding the table metadata to the scan token is enabled by
default given it’s more scalable and performant. However,
it can be disabled in rare cases where more resiliency to
column renaming is desired. One example where disabling the
table metadata is used is the backup job. Future work, tracked
by KUDU-3146, should allow for table metadata to be leveraged in
those cases as well.

This doesn’t avoid the need for a call to the master to get the
schema in the case of writing data to Kudu, that work is tracked
by KUDU-3135. I expect the TableMetadataPB message would
be used there as well.

I included the ability to disable this functionality in the
kudu-spark integration via `kudu.useDriverMetadata` just
in case there are any unforeseen issues or regressions with
this feature.

I added a test to compare the serialized size of the scan token with
and without the table and tablet metadata. The size results for a
100 column table are:
   no metadata: 2697 Bytes
   tablet metadata: 2805
   tablet, table, and authz metadata: 3258

Change-Id: I88c1b8392de37dd5e8b7bd8b78a21603ff8b1d1b
Reviewed-on: http://gerrit.cloudera.org:8080/16031
Reviewed-by: Grant Henke <gr...@apache.org>
Tested-by: Grant Henke <gr...@apache.org>


> Add Client Metadata Tokens
> --------------------------
>
>                 Key: KUDU-3135
>                 URL: https://issues.apache.org/jira/browse/KUDU-3135
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.12.0
>            Reporter: Grant Henke
>            Assignee: Grant Henke
>            Priority: Major
>              Labels: roadmap-candidate, scalability
>
> Currently when a distributed task is done using the Kudu client, the driver/coordinator client needs to open the table to request its current metadata and locations. Then it can distribute the work to tasks/executors on remote nodes. In the case of reading data, often ScanTokens are used to distribute the work, and in the case of writing data perhaps just the table name is required.
> The problem is that each parallel task then also needs to open the table to request the metadata for the table. Using Spark as an example, this happens when deserializing the scan tokens in KuduRDD ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108]) or when writing rows using the KuduContext ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]). This results in a large burst of metadata requests to the leader Kudu master all at once. Given the Kudu master is only a single server and requests can't be served from the follower masters, this effectively limits the amount of parallel tasks that can run in a large Kudu deployment. Even if the follower masters could service the requests, that still limits scalability in very large clusters given most deployments would only have 3-5 masters.
> Adding a metadata token, similar to a scan token, would be a useful way to allow the single driver to fetch all the metadata required for the parallel tasks. The tokens can be serialized and then passed to each task in a similar fashion to scan tokens.
> Of course in a pessimistic case, something may change between generation of the token and the start of the task. In that case a request would need to be sent to get the updated metadata. However, that scenario should be rare and likely would not result in all of the requests happening at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)