You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2018/07/23 20:43:00 UTC
[jira] [Updated] (BEAM-2661) Add KuduIO

     [ https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Robertson updated BEAM-2661:
--------------------------------
    Description: 
New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).

This work is in progress [on this branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with design aspects documented below.
h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects into operations. This is similar to the {{JdbcIO}} but different to others, such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert into the mutations to apply. It was originally intended to copy the {{HBaseIO}} approach, but this was not possible:
 # The Kudu [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html] is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It holds RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does not serialize and furthermore, the logic for encoding the operations (Insert, Upsert etc) in the Kudu Java API are one way only (no decode) because the server is written in C++.
 # An alternative could be to introduce a new object to beam (e.g. {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection<KuduOperation>}}. This was considered but was discounted because:
 ## It is not a familiar API to those already knowing Kudu
 ## It still requires serialization and deserialization of the operations. Using the existing Kudu approach of serializing into compact byte arrays would require a decoder along the lines of [this almost complete example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]. This is possible but has fragilities given the Kudu code itself continues to evolve. 
 ## It becomes a trivial codebase in Beam to maintain by defer the object to mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the precedent to do this.

h2. Testing framework

{{Kudu}} is written in C++. While a [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java] does exist in Java, it requires binaries to be available for the target environment which is not portable (edit: this is now a [work in progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore we opt for the following:
 # Unit tests will use a mock Kudu client
 # Integration tests will cover the full aspects of the {{KuduIO}} and use a Docker based Kudu instance

  was:
New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).

This work is in progress [on this branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with design aspects documented below.
h2. The API

The {{KuduIO}} API requires the user to provide a function to convert objects into operations. This is similar to the {{JdbcIO}} but different to others, such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert into the mutations to apply. It was originally intended to copy the {{HBaseIO}} approach, but this was not possible:
 # The Kudu [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html] is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It holds RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does not serialize and furthermore, the logic for encoding the operations (Insert, Upsert etc) in the Kudu Java API are one way only (no decode) because the server is written in C++.
 # An alternative could be to introduce a new object to beam (e.g. {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection<KuduOperation>}}. This was considered but was discounted because:
 ## It is not a familiar API to those already knowing Kudu
 ## It still requires serialization and deserialization of the operations. Using the existing Kudu approach of serializing into compact byte arrays would require a decoder along the lines of [this almost complete example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]. This is possible but has fragilities given the Kudu code itself continues to evolve. 
 ## It becomes a trivial codebase in Beam to maintain by defer the object to mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the precedent to do this.

h2. Testing framework

{{Kudu}} is written in C++. While a [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java] does exist in Java, it requires binaries to be available for the target environment which is not portable. Therefore we opt for the following:
 # Unit tests will use a mock Kudu client
 # Integration tests will cover the full aspects of the {{KuduIO}} and use a Docker based Kudu instance


> Add KuduIO
> ----------
>
>                 Key: BEAM-2661
>                 URL: https://issues.apache.org/jira/browse/BEAM-2661
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-ideas
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Tim Robertson
>            Priority: Major
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects into operations. This is similar to the {{JdbcIO}} but different to others, such as {{HBaseIO}} which requires a pre-transform stage beforehand to convert into the mutations to apply. It was originally intended to copy the {{HBaseIO}} approach, but this was not possible:
>  # The Kudu [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html] is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It holds RPC logic, callbacks and a Kudu client. Because of this the {{Operation}} does not serialize and furthermore, the logic for encoding the operations (Insert, Upsert etc) in the Kudu Java API are one way only (no decode) because the server is written in C++.
>  # An alternative could be to introduce a new object to beam (e.g. {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable {{PCollection<KuduOperation>}}. This was considered but was discounted because:
>  ## It is not a familiar API to those already knowing Kudu
>  ## It still requires serialization and deserialization of the operations. Using the existing Kudu approach of serializing into compact byte arrays would require a decoder along the lines of [this almost complete example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]. This is possible but has fragilities given the Kudu code itself continues to evolve. 
>  ## It becomes a trivial codebase in Beam to maintain by defer the object to mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the precedent to do this.
> h2. Testing framework
> {{Kudu}} is written in C++. While a [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java] does exist in Java, it requires binaries to be available for the target environment which is not portable (edit: this is now a [work in progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore we opt for the following:
>  # Unit tests will use a mock Kudu client
>  # Integration tests will cover the full aspects of the {{KuduIO}} and use a Docker based Kudu instance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)