You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "kirkrodrigues (via GitHub)" <gi...@apache.org> on 2023/07/28 19:05:19 UTC

[GitHub] [pinot] kirkrodrigues opened a new pull request, #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

kirkrodrigues opened a new pull request, #11210:
URL: https://github.com/apache/pinot/pull/11210

   tags: feature, release-notes
   
   This adds a [RecordTransformer](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/RecordTransformer.java) to transform semi-structured (e.g., JSON) log events to fit a table's schema without dropping fields.
   
   JSON log events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus, this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field.
   
   For example, consider this log event:
   ```
    {
      "timestamp": 1687786535928,
      "hostname": "host1",
      "level": "INFO",
      "message": "Started processing job1",
      "tags": {
        "platform": "data",
        "service": "serializer",
        "params": {
          "queueLength": 5,
          "timeout": 299,
          "userData_noIndex": {
            "nth": 99
          }
        }
      }
    }
   ```
    And let's say the table's schema contains these fields:
   * timestamp
   * hostname
   * level
   * message
   * tags.platform
   * tags.service
   * indexableExtras
   * unindexableExtras
   
    Without this transformer, the entire `tags` field would be dropped when storing the record in the table. However,
    with this transformer, the record would be transformed into the following:
   ```
    {
      "timestamp": 1687786535928,
      "hostname": "host1",
      "level": "INFO",
      "message": "Started processing job1",
      "tags.platform": "data",
      "tags.service": "serializer",
      "indexableExtras": {
        "tags": {
          "params": {
            "queueLength": 5,
            "timeout": 299
          }
        }
      },
      "unindexableExtras": {
        "tags": {
          "userData_noIndex": {
            "nth": 99
          }
        }
      }
    }
   ```
   
   Notice that the transformer:
   * Flattens nested fields which exist in the schema, like `tags.platform`
   * Moves fields which don't exist in the schema into the `indexableExtras` field
   * Moves fields which don't exist in the schema and have the suffix "_noIndex" into the `unindexableExtras` field
   
   The `unindexableExtras` field allows the transformer to separate fields which don't need indexing (because they are
    only retrieved, not searched) from those that do. The transformer also has other configuration options specified in `JsonLogTransformerConfig`.
   
   This is part of the change requested in #9819 and described in this [design doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit#heading=h.itv87iq05rqh).
   
   # Testing performed
   * Added new unit tests.
   * Validated JSON log events with dynamic schemas could be ingested into a table without dropping fields (unless configured to).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287770149


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider

Review Comment:
   I think in the javadoc, we should in general explain only the behavior of the transformer. How the transformer is implemented can be removed from this javadoc section and moved to the codes. In the following example, I think we should emphasize that a schema path can be more than 2 levels. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287023681


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>

Review Comment:
   Do you mean entries in an array, where multiple entries have the same key? We don't flatten the contents of arrays.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287769204


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in

Review Comment:
   Can we mention the `fieldPathsToDrop` field and show how it is used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287023147


##########
pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamDataDecoderImpl.java:
##########
@@ -34,6 +34,13 @@ public class StreamDataDecoderImpl implements StreamDataDecoder {
   private final StreamMessageDecoder _valueDecoder;
   private final GenericRow _reuse = new GenericRow();
 
+  /**
+   * @return Whether the given key is one of the special types of keys (__KEY, header, metadata, etc.)
+   */
+  public static boolean isSpecialKeyType(String key) {

Review Comment:
   It's used so that we can *skip* transforming these fields in the `JsonLogTransformer`. I put it in this class so that if a developer changes the list of special fields in this class, they know to update this method as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1290341303


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider

Review Comment:
   Good point. Moved this section to `processFields`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1290342559


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"

Review Comment:
   Oh it's already configurable through [JsonLogTransformerConfig](https://github.com/apache/pinot/pull/11210/files#diff-8a155cf3e668ba08fe3d5063d0ff76e54b83abace4f5fefbab93ce3bf2edf93cR38). I added some text to the Javadoc indicating it's configurable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287755010


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>

Review Comment:
   What I meant is fields like "platform" occur multiple times under "tags".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1290341638


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in

Review Comment:
   Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287024788


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {

Review Comment:
   Good point. Instead of calling it `JsonLogTransformer`, we could pick a name that reflects that it takes an input record and transforms it to fit the schema losslessly. Here are some options I can think of:
   
   1. `LosslessSchemaTransformer`
   1. `SchemaAdaptationTransformer`
   1. `SchemaFitTransformer`
   1. `SchemaMoldingTransformer`
   1. `TableAdaptationTransformer`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1284838648


##########
pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamDataDecoderImpl.java:
##########
@@ -34,6 +34,13 @@ public class StreamDataDecoderImpl implements StreamDataDecoder {
   private final StreamMessageDecoder _valueDecoder;
   private final GenericRow _reuse = new GenericRow();
 
+  /**
+   * @return Whether the given key is one of the special types of keys (__KEY, header, metadata, etc.)
+   */
+  public static boolean isSpecialKeyType(String key) {

Review Comment:
   Why need this static method? It is used only once and can be put outside of the class?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>

Review Comment:
   (1) what happens if the nest field occurs multiple times in the json? will we put all the values in the column?
   



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>

Review Comment:
   Is this index referring to Pinot index? Is there any restriction or assumption on the index that should be applied?



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {

Review Comment:
   I wonder how much of the json transformation here is specific to log related application only? Can we refactor the config/transformer so that it can be used for all Json transformation in general? A lot of what this transformer does can be used for non-logging purpose. If possible, I prefer we made a general json transformer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287762730


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {
+  private static final Logger _logger = LoggerFactory.getLogger(JsonLogTransformer.class);
+
+  private final boolean _continueOnError;
+  private final JsonLogTransformerConfig _transformerConfig;
+  private final DataType _indexableExtrasFieldType;
+  private final DataType _unindexableExtrasFieldType;
+
+  private Map<String, Object> _schemaTree;
+
+  /**
+   * Validates the schema against the given transformer's configuration.
+   */
+  public static void validateSchema(@Nonnull Schema schema, @Nonnull JsonLogTransformerConfig transformerConfig) {
+    validateSchemaFieldNames(schema.getPhysicalColumnNames(), transformerConfig);
+
+    String indexableExtrasFieldName = transformerConfig.getIndexableExtrasField();
+    getAndValidateExtrasFieldType(schema, indexableExtrasFieldName);
+    String unindexableExtrasFieldName = transformerConfig.getUnindexableExtrasField();
+    if (null != unindexableExtrasFieldName) {
+      getAndValidateExtrasFieldType(schema, indexableExtrasFieldName);
+    }
+
+    validateSchemaAndCreateTree(schema);
+  }
+
+  /**
+   * Validates that none of the schema fields have names that conflict with the transformer's configuration.
+   */
+  private static void validateSchemaFieldNames(Set<String> schemaFields, JsonLogTransformerConfig transformerConfig) {
+    // Validate that none of the columns in the schema end with unindexableFieldSuffix
+    String unindexableFieldSuffix = transformerConfig.getUnindexableFieldSuffix();
+    if (null != unindexableFieldSuffix) {
+      for (String field : schemaFields) {
+        Preconditions.checkState(!field.endsWith(unindexableFieldSuffix), "Field '%s' has no-index suffix '%s'", field,
+            unindexableFieldSuffix);
+      }
+    }
+
+    // Validate that none of the columns in the schema end overlap with the fields in fieldPathsToDrop
+    Set<String> fieldPathsToDrop = transformerConfig.getFieldPathsToDrop();
+    if (null != fieldPathsToDrop) {
+      Set<String> fieldIntersection = new HashSet<>(schemaFields);
+      fieldIntersection.retainAll(fieldPathsToDrop);
+      Preconditions.checkState(fieldIntersection.isEmpty(), "Fields in schema overlap with fieldPathsToDrop");
+    }
+  }
+
+  /**
+   * @return The field type for the given extras field
+   */
+  private static DataType getAndValidateExtrasFieldType(Schema schema, @Nonnull String extrasFieldName) {
+    FieldSpec fieldSpec = schema.getFieldSpecFor(extrasFieldName);
+    Preconditions.checkState(null != fieldSpec, "Field '%s' doesn't exist in schema", extrasFieldName);
+    DataType fieldDataType = fieldSpec.getDataType();
+    Preconditions.checkState(DataType.JSON == fieldDataType || DataType.STRING == fieldDataType,
+        "Field '%s' has unsupported type %s", fieldDataType.toString());
+    return fieldDataType;
+  }
+
+  /**
+   * Validates the schema with a JsonLogTransformerConfig instance and creates a tree representing the fields in the
+   * schema to be used when transforming input records. For instance, the field "a.b" in the schema would be
+   * un-flattened into "{a: b: null}" in the tree, allowing us to more easily process records containing the latter.
+   * @throws IllegalArgumentException if schema validation fails

Review Comment:
   Can you add javadoc (and some examples) on when the schema validation will fail? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287024009


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>

Review Comment:
   Yes, it's referring to a Pinot index; the suggestion is for the user to index `indexableExtras` and to *not* index `unindexableExtras`. However, the transformer doesn't restrict the user from applying any index they choose to either column, so long as it can apply to a JSON/string type column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kirkrodrigues commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "kirkrodrigues (via GitHub)" <gi...@apache.org>.
kirkrodrigues commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1290343214


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {

Review Comment:
   We could add "Json", but I think this is more general than JSON since it could handle records from any of the decoders like CSV, Parquet, etc. I'm fine with adding "Json" if you think that's best.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>

Review Comment:
   I might be misunderstanding, but since the input to the transformer is a `GenericRow` (which uses a map to store columns), I don't think we can have duplicate keys in the row.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287758144


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"

Review Comment:
   Can we make the suffix configurable in the table config?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1300661344


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {

Review Comment:
   SchemaConformingTransformer then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #11210:
URL: https://github.com/apache/pinot/pull/11210#issuecomment-1656236261

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#11210](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (9ad0b87) into [master](https://app.codecov.io/gh/apache/pinot/commit/11b85df3318ab414a7dff30e2b1d979259b984fc?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (11b85df) will **increase** coverage by `0.00%`.
   > Report is 11 commits behind head on master.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@            Coverage Diff            @@
   ##           master   #11210     +/-   ##
   =========================================
     Coverage    0.11%    0.11%             
   =========================================
     Files        2223     2173     -50     
     Lines      119304   117024   -2280     
     Branches    18059    17794    -265     
   =========================================
     Hits          137      137             
   + Misses     119147   116867   -2280     
     Partials       20       20             
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1temurin17 | `?` | |
   | unittests1temurin11 | `?` | |
   | unittests1temurin17 | `?` | |
   | unittests1temurin20 | `?` | |
   | unittests2temurin11 | `0.11% <0.00%> (-0.01%)` | :arrow_down: |
   | unittests2temurin17 | `0.11% <0.00%> (-0.01%)` | :arrow_down: |
   | unittests2temurin20 | `0.11% <0.00%> (-0.01%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Files Changed](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [.../local/recordtransformer/CompositeTransformer.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWNvcmR0cmFuc2Zvcm1lci9Db21wb3NpdGVUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...nt/local/recordtransformer/JsonLogTransformer.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWNvcmR0cmFuc2Zvcm1lci9Kc29uTG9nVHJhbnNmb3JtZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ache/pinot/segment/local/utils/IngestionUtils.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9Jbmdlc3Rpb25VdGlscy5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...he/pinot/segment/local/utils/TableConfigUtils.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC91dGlscy9UYWJsZUNvbmZpZ1V0aWxzLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ot/spi/config/table/ingestion/IngestionConfig.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvY29uZmlnL3RhYmxlL2luZ2VzdGlvbi9Jbmdlc3Rpb25Db25maWcuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...nfig/table/ingestion/JsonLogTransformerConfig.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvY29uZmlnL3RhYmxlL2luZ2VzdGlvbi9Kc29uTG9nVHJhbnNmb3JtZXJDb25maWcuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...apache/pinot/spi/stream/StreamDataDecoderImpl.java](https://app.codecov.io/gh/apache/pinot/pull/11210?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3Qtc3BpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9zcGkvc3RyZWFtL1N0cmVhbURhdGFEZWNvZGVySW1wbC5qYXZh) | `0.00% <0.00%> (ø)` | |
   
   ... and [89 files with indirect coverage changes](https://app.codecov.io/gh/apache/pinot/pull/11210/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287759588


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider
+ * this record
+ * <pre>
+ * {
+ *   a: {
+ *     b: {
+ *       c: 0,
+ *       d: 1
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Assume "$.a.b.c" exists in the schema but "$.a.b.d" doesn't. This class processes the record recursively from the
+ * root node to the children, so it would only know that "$.a.b.d" doesn't exist when it gets to "d". At this point we
+ * need to add "d" and all of its parents to the indexableExtrasField. To do so efficiently, the class builds this
+ * branch starting from the leaf and attaches it to parent nodes as we return from each recursive call.
+ */
+public class JsonLogTransformer implements RecordTransformer {

Review Comment:
   No 2 looks like a good one. Should we also add json to the class name and make it JsonSchemaAdaptionTransformer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat merged pull request #11210: Add SchemaConformingTransformer to transform records with varying keys to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat merged PR #11210:
URL: https://github.com/apache/pinot/pull/11210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] chenboat commented on a diff in pull request #11210: Add JsonLogTransformer to transform semi-structured log events to fit a table's schema without dropping fields.

Posted by "chenboat (via GitHub)" <gi...@apache.org>.
chenboat commented on code in PR #11210:
URL: https://github.com/apache/pinot/pull/11210#discussion_r1287770149


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/JsonLogTransformer.java:
##########
@@ -0,0 +1,534 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.recordtransformer;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.google.common.base.Preconditions;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import javax.annotation.Nonnull;
+import javax.annotation.Nullable;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.config.table.ingestion.JsonLogTransformerConfig;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.data.Schema;
+import org.apache.pinot.spi.data.readers.GenericRow;
+import org.apache.pinot.spi.stream.StreamDataDecoderImpl;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * This transformer transforms a record representing a JSON log event such that it can be stored in a table. JSON log
+ * events typically have a user-defined schema, so it is impractical to store each field in its own table column. At the
+ * same time, most (if not all) fields are important to the user, so we should not drop any field unnecessarily. Thus,
+ * this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall
+ * field.
+ * <p>
+ * For example, consider this log event:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags": {
+ *     "platform": "data",
+ *     "service": "serializer",
+ *     "params": {
+ *       "queueLength": 5,
+ *       "timeout": 299,
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * And let's say the table's schema contains these fields:
+ * <ul>
+ *   <li>timestamp</li>
+ *   <li>hostname</li>
+ *   <li>level</li>
+ *   <li>message</li>
+ *   <li>tags.platform</li>
+ *   <li>tags.service</li>
+ *   <li>indexableExtras</li>
+ *   <li>unindexableExtras</li>
+ * </ul>
+ * <p>
+ * Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However,
+ * with this transformer, the record would be transformed into the following:
+ * <pre>
+ * {
+ *   "timestamp": 1687786535928,
+ *   "hostname": "host1",
+ *   "level": "INFO",
+ *   "message": "Started processing job1",
+ *   "tags.platform": "data",
+ *   "tags.service": "serializer",
+ *   "indexableExtras": {
+ *     "tags": {
+ *       "params": {
+ *         "queueLength": 5,
+ *         "timeout": 299
+ *       }
+ *     }
+ *   },
+ *   "unindexableExtras": {
+ *     "tags": {
+ *       "userData_noIndex": {
+ *         "nth": 99
+ *       }
+ *     }
+ *   }
+ * }
+ * </pre>
+ * Notice that the transformer:
+ * <ul>
+ *   <li>Flattens nested fields which exist in the schema, like "tags.platform"</li>
+ *   <li>Moves fields which don't exist in the schema into the "indexableExtras" field</li>
+ *   <li>Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras"
+ *   field</li>
+ * </ul>
+ * <p>
+ * The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are
+ * only retrieved, not searched) from those that do. The transformer also has other configuration options specified in
+ * {@link JsonLogTransformerConfig}.
+ * <p>
+ * One notable complication that this class handles is adding nested fields to the "extras" fields. E.g., consider

Review Comment:
   I think in the javadoc, we should in general explain only the behavior of the transformer. How the transformer is implemented can be removed from this javadoc section to keep the story simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org