You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Randall Hauch (Jira)" <ji...@apache.org> on 2021/02/08 17:37:01 UTC
[jira] [Comment Edited] (KAFKA-12305) Flatten SMT fails on arrays

    [ https://issues.apache.org/jira/browse/KAFKA-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281246#comment-17281246 ] 

Randall Hauch edited comment on KAFKA-12305 at 2/8/21, 5:36 PM:
----------------------------------------------------------------

[~ChrisEgerton] wrote:
{quote}
A naive approach that preserves arrays as-are and doesn't attempt to flatten them seems fair for now, but one alternative could be to traverse array elements and, if any are maps or structs, flatten those as well.
{quote}
+1 for this behavior. Here's my thought process:

The `Flatten` transform documentation ([source|https://github.com/apache/kafka/blob/8bd5ceb3d297bd6cd06ccec52978315898719e6d/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Flatten.java#L43-L45]) says:
{quote}
Flatten a nested data structure, generating names for each field by concatenating the field names at each level with a configurable delimiter character. Applies to a Struct when a schema is present, or a Map in the case of schemaless data.
{quote}
IMO, this explains the intention: flatten nested `Struct` instances if using a schema(e.g., flatten the Struct fields of a record's key/value Struct) or nested `Map` instances if using a schemaless key/value. 

Nowhere does it mention that arrays are flattened into a separate field for each element in the array.

For example, consider this record key or value (with a top-level schema that has 4 fields, one of which is a struct with 3 fields):
{code}
{
  "f1": "field 1 value",           // field with string schema
  "f2": {                          // field with struct schema containing 3 fields
    "nestedInt": 0,                // field with int32 schema
    "nestedString": "nested",      // field with string schema
    "nestedArray": [ "v1", "v2" ]  // field with array(string) schema
  },
  "f3": [ "e1", "e2", "e3" ],      // field with array(string) schema
  "f4": <null>      // field with optional float32 schema
}
{code}

Using what you mention as the "naive approach", Flatten applied to the record key/value described above should produce the following key or value (with a schema that has 5 fields):
{code}
  "f1": "field 1 value",            // field with string schema
  "f2.nestedInt": 0,                // field with int32 schema
  "f2.nestedString": "nested",      // field with string schema
  "f2.nestedArray": [ "v1", "v2" ]  // field with array(string) schema
  "f3": [ "e1", "e2", "e3" ],       // field with array(string) schema
  "f4": <null>                      // field with optional float32 schema
{code}

Note that the arrays are *_not_* expanded into separate fields for each element. This makes sense to me and aligns with the previously mentioned documentation for the Flatten SMT.


was (Author: rhauch):
[~ChrisEgerton] wrote:
{quote}
A naive approach that preserves arrays as-are and doesn't attempt to flatten them seems fair for now, but one alternative could be to traverse array elements and, if any are maps or structs, flatten those as well.
{quote}
+1 for this behavior. Here's my thought process:

The `Flatten` transform documentation ([source|https://github.com/apache/kafka/blob/8bd5ceb3d297bd6cd06ccec52978315898719e6d/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Flatten.java#L43-L45]) says:
{quote}
Flatten a nested data structure, generating names for each field by concatenating the field names at each level with a configurable delimiter character. Applies to a Struct when a schema is present, or a Map in the case of schemaless data.
{quote}
IMO, this explains the intention: flatten nested `Struct` instances if using a schema(e.g., flatten the Struct fields of a record's key/value Struct) or nested `Map` instances if using a schemaless key/value. 

Nowhere does it mention that arrays are flattened into a separate field for each element in the array.

For example, consider this record key or value (with a top-level schema that has 4 fields, one of which is a struct with 3 fields):
{code}
{
  "f1": "field 1 value",           // field with string schema
  "f2": {                          // field with struct schema containing 3 fields
    "nestedInt": 0,                // field with int32 schema
    "nestedString": "nested",      // field with string schema
    "nestedArray": [ "v1", "v2" ]  // field with array(string) schema
  },
  "f3": [ "e1", "e2", "e3" ],      // field with array(string) schema
  "f4": <null>      // field with optional float32 schema
}
{code}

Using what you mention as the "naive approach", Flatten applied to the record key/value described above should produce the following key or value (with a schema that has 5 fields):
{code}
  "f1": "field 1 value",            // field with string schema
  "f2.nestedInt": 0,                // field with int32 schema
  "f2.nestedString": "nested",      // field with string schema
  "f2.nestedArray": [ "v1", "v2" ]  // field with array(string) schema
  "f3": [ "e1", "e2", "e3" ],       // field with array(string) schema
  "f4": <null>                      // field with optional float32 schema
{code}

Note that the arrays are *_not_* expanded into separate fields for each element.

> Flatten SMT fails on arrays
> ---------------------------
>
>                 Key: KAFKA-12305
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12305
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.0.1, 2.1.1, 2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.7.0, 2.6.1, 2.8.0
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>
> The {{Flatten}} SMT fails for array types. A sophisticated approach that tries to flatten arrays might be desirable in some cases, and may have been punted during the early design phase of the transform, but in the interim, it's probably not worth it to make array data and the SMT mutually exclusive.
> A naive approach that preserves arrays as-are and doesn't attempt to flatten them seems fair for now, but one alternative could be to traverse array elements and, if any are maps or structs, flatten those as well.
> Adding behavior to fully flatten arrays by essentially transforming them into maps whose elements are the elements of the array and whose keys are the indices of each element is likely out of scope for a bug fix and, although useful, might have to wait for a KIP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)