You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by GitBox <gi...@apache.org> on 2021/09/29 15:53:19 UTC

[GitHub] [carbondata] pratyakshsharma opened a new pull request #4227: schema evolution test cases w/o data type change working

pratyakshsharma opened a new pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227


    ### Why is this PR needed?
    
    
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930413780


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/364/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933332249


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4244/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933328785


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5990/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728023802



##########
File path: common/src/main/java/org/apache/carbondata/common/exceptions/sql/CarbonSchemaException.java
##########
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.common.exceptions.sql;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.common.annotations.InterfaceStability;
+
+@InterfaceAudience.User
+@InterfaceStability.Stable
+public class CarbonSchemaException extends Exception {
+
+  private static final long serialVersionUID = 1L;
+
+  private final String msg;
+
+  public CarbonSchemaException(String msg) {
+    super(msg);
+    this.msg = msg;
+  }
+
+  public String getMsg() {

Review comment:
       ```suggestion
     public String getMsg() {
   ```
   ```suggestion
     public String getMessage() {
   ```

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    if (operationType != null &&
+        !MergeOperationType.withName(operationType.toUpperCase).equals(MergeOperationType.INSERT) &&
+        filterDupes) {
+      throw new MalformedCarbonCommandException("property CARBON_STREAMER_INSERT_DEDUPLICATE" +
+                                                " should only be set with operation type INSERT")
+    }
+    val isSchemaEnforcementEnabled = properties
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (operationType != null) {
+      if (isSchemaEnforcementEnabled) {
+        // call the util function to verify if incoming schema matches with target schema
+        CarbonMergeDataSetUtil.verifySourceAndTargetSchemas(targetDsOri, srcDS)
+      } else {
+        CarbonMergeDataSetUtil.handleSchemaEvolution(
+          targetDsOri, srcDS, sparkSession)
+      }
+    }
+
     // Target dataset must be backed by carbondata table.
-    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val tgtTable = relations.head.carbonRelation.carbonTable
+    val targetCarbonTable: CarbonTable = CarbonEnv.getCarbonTable(Option(tgtTable.getDatabaseName),

Review comment:
       why this code is required ?

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       looks like, the validations added here, is applicable only when operationType is not Null. So, please move these validations inside  if check (operationType != null )

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {

Review comment:
       can create new variable for sourceSchema.fields.map(_.name.toLowerCase) and reuse in line:516 

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")

Review comment:
       can add exception message to a new variable and reuse. Please handle in all places applicable

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       can move sourceSchema.fields.map(_.name.toLowerCase) outside the loop to new variable and reuse

##########
File path: pom.xml
##########
@@ -130,7 +130,7 @@
     <scala.version>2.11.8</scala.version>
     <hadoop.deps.scope>compile</hadoop.deps.scope>
     <spark.version>2.3.4</spark.version>
-    <spark.binary.version>2.3</spark.binary.version>
+    <spark.binary.version>2.4</spark.binary.version>

Review comment:
       why this change ? please revert

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    if (isPrimitiveAndNotDate) {
+      Comparator.getComparator(orderingFieldDataType)
+    } else if (orderingFieldDataType == DataTypes.STRING) {
+      null
+    } else {
+      Comparator.getComparatorByDataTypeForMeasure(orderingFieldDataType)
+    }
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = if ((keyColumnDataType eq DataTypes.BOOLEAN) ||

Review comment:
       can replace this with CASE MATCH

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&

Review comment:
       1. I think, this method itself not needed. Comparator.getComparator can be directly called. For string type, ByteArraySerializableComparator can be used.
   2. Please refactor the caller method also
   3. For DATE type, might throw IllegalArgumentException. Please check and handle

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -1215,6 +1215,17 @@ private CarbonCommonConstants() {
 
   public static final String CARBON_ENABLE_BAD_RECORD_HANDLING_FOR_INSERT_DEFAULT = "false";
 
+  /**
+   * This flag decides if table schema needs to change as per the incoming batch schema.

Review comment:
       can move/group CDC related properties in same place

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {

Review comment:
       This check is not needed. Can move sourceField variable before this check and replace this check with sourceField.isempty

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {

Review comment:
       same code is present in verifySourceAndTargetSchemas method. Can move this code to common method and reuse

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -16,31 +16,41 @@
  */
 package org.apache.spark.sql.execution.command.mutation.merge
 
+import java.nio.charset.Charset
 import java.util
 
 import scala.collection.JavaConverters._
 import scala.collection.mutable
 
+import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{CarbonDatasourceHadoopRelation, Dataset, Row, SparkSession}
-import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.carbondata.execution.datasources.CarbonSparkDataSourceUtil
+import org.apache.spark.sql.catalyst.{CarbonParserUtil, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
 import org.apache.spark.sql.catalyst.expressions.EqualTo
 import org.apache.spark.sql.execution.CastExpressionOptimization
+import org.apache.spark.sql.execution.command.{AlterTableAddColumnsModel, AlterTableDataTypeChangeModel, AlterTableDropColumnModel}
+import org.apache.spark.sql.execution.command.schema.{CarbonAlterTableAddColumnCommand, CarbonAlterTableColRenameDataTypeChangeCommand, CarbonAlterTableDropColumnCommand}
+import org.apache.spark.sql.functions.expr
 import org.apache.spark.sql.optimizer.CarbonFilters
-import org.apache.spark.sql.types.DateType
+import org.apache.spark.sql.parser.CarbonSpark2SqlParser
+import org.apache.spark.sql.types.{DateType, DecimalType, StructField}
 
+import org.apache.carbondata.common.exceptions.sql.{CarbonSchemaException, MalformedCarbonCommandException}

Review comment:
       please remove unused import - MalformedCarbonCommandException

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")

Review comment:
       can move this to previous line

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))

Review comment:
       tgtField.name.toLowerCase - converting to lowercase is not needed, since equalsIgnoreCase check is used here

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,

Review comment:
       same code is available in DDLHelper.addColumns. can move common code to new method  and reuse

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)

Review comment:
       please add braces and format the code here

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,

Review comment:
       ```suggestion
           srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
   ```
   ```suggestion
           sourceSchema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
   ```
   after this, can format to single line

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>

Review comment:
       please format this code

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {

Review comment:
       same code is available in DDLHelper.changeColumn. can move common code to new method and reuse
   

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -847,20 +820,222 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
         Row("j", 2, "RUSSIA"), Row("k", 0, "INDIA")))
   }
 
-  test("test all the merge APIs UPDATE, DELETE, UPSERT and INSERT") {
+  def prepareTarget(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
     sql("drop table if exists target")
-    val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
       Row("a", "0"),
       Row("b", "1"),
       Row("c", "2"),
       Row("d", "3")
     ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
-    initframe.write
-      .format("carbondata")
-      .option("tableName", "target")
-      .mode(SaveMode.Overwrite)
-      .save()
-    val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  def prepareTargetWithThreeFields(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
+    sql("drop table if exists target")
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 0, "CHINA"),
+      Row("b", 1, "INDIA"),
+      Row("c", 2, "INDIA"),
+      Row("d", 3, "US")
+    ).asJava,
+      StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType),
+        StructField("country", StringType))))
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  test("test schema enforcement") {
+    val target = prepareTarget()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", "1", "ab"),
+      Row("d", "4", "de")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", StringType)
+      , StructField("new_value", StringType))))
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "true"
+    )
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", "1"), Row("b", "1"), Row("c", "2"), Row("d", "4")))
+
+    properties.addProperty(
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "true"
+    )
+
+    val exceptionCaught1 = intercept[MalformedCarbonCommandException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1, "ab"),
+        Row("d", 4, "de")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType)
+        , StructField("new_value", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught1.getMessage
+      .contains(
+        "property CARBON_STREAMER_INSERT_DEDUPLICATE should " +
+        "only be set with operation type INSERT"))
+
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    val exceptionCaught2 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("val", IntegerType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught2.getMessage.contains("source schema does not contain field: value"))
+
+    val exceptionCaught3 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", LongType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught3.getMsg.contains("source schema has different " +
+                                            "data type for field: value"))
+
+    val exceptionCaught4 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", "1", "A"),
+        Row("d", "4", "D")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", StringType), StructField("Key", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught4.getMsg.contains("source schema has similar fields which " +
+                                            "differ only in case sensitivity: key"))
+  }
+
+  test("test schema evolution") {
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD, "value"
+    )
+    sql("drop table if exists target")
+    var target = prepareTargetWithThreeFields()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 1, "ab", "china"),
+      Row("d", 4, "de", "china"),
+      Row("d", 7, "updated_de", "china_pro")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType)
+      , StructField("new_value", StringType),
+      StructField("country", StringType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 1, "china", "ab"), Row("b", 1, "INDIA", null),
+        Row("c", 2, "INDIA", null), Row("d", 7, "china_pro", "updated_de")))
+
+    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 5),
+      Row("d", 5)
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 5), Row("b", 1),
+        Row("c", 2), Row("d", 5)))
+
+//    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+//    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+//      Row("b", 50),
+//      Row("d", 50)

Review comment:
       please remove this code if not needed

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,

Review comment:
       remove this code

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()

Review comment:
       remove this line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r731959885



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       you mean I should have a single if check and club all the new code in it, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732178451



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {

Review comment:
       Thank you for pointing this out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948410531


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732500910



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Right. Got it.

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Right. Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948541781


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948070102


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/466/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949377854


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4345/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734935130



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)

Review comment:
       try to combine the multiple map, filters to a single collect




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-966931266


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6114/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732187965



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732530921



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       ok, but do you see any difference in terms of execution or performance with this? @akashrn5 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949514557


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4346/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948647648


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6084/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r736171916



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2681,4 +2681,61 @@ private CarbonCommonConstants() {
 
   public static final String CARBON_CDC_MINMAX_PRUNING_ENABLED_DEFAULT = "false";
 
+  //////////////////////////////////////////////////////////////////////////////////////////
+  // CDC streamer configs start here
+  //////////////////////////////////////////////////////////////////////////////////////////
+
+  /**
+   * Name of the field from source schema whose value can be used for picking the latest updates for
+   * a particular record in the incoming batch in case of duplicates record keys. Useful if the
+   * write operation type is UPDATE or UPSERT. This will be used only if
+   * carbon.streamer.upsert.deduplicate is enabled.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_SOURCE_ORDERING_FIELD =
+      "carbon.streamer.source.ordering.field";
+
+  public static final String CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT = "";
+
+  /**
+   * This property specifies if the incoming batch needs to be deduplicated in case of INSERT
+   * operation type. If set to true, the incoming batch will be deduplicated against the existing
+   * data in the target carbondata table.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_INSERT_DEDUPLICATE =
+      "carbon.streamer.insert.deduplicate";
+
+  public static final String CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT = "false";
+
+  /**
+   * This property specifies if the incoming batch needs to be deduplicated (when multiple updates
+   * for the same record key are present in the incoming batch) in case of UPSERT/UPDATE operation
+   * type. If set to true, the user needs to provide proper value for the source ordering field as
+   * well.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_UPSERT_DEDUPLICATE =
+      "carbon.streamer.upsert.deduplicate";
+
+  public static final String CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT = "true";
+
+  /**
+   * The metadata columns coming from the source stream data, which should not be included in the
+   * target data.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_META_COLUMNS = "carbon.streamer.meta.columns";
+
+  /**
+   * This flag decides if table schema needs to change as per the incoming batch schema.
+   * If set to true, incoming schema will be validated with existing table schema.
+   * If the schema has evolved, the incoming batch cannot be ingested and
+   * job will simply fail.
+   */
+  @CarbonProperty
+  public static final String CARBON_ENABLE_SCHEMA_ENFORCEMENT = "carbon.enable.schema.enforcement";

Review comment:
       yeah, we have few more properties here - https://github.com/apache/carbondata/pull/4235. Will add all the properties together once both PRs get merged. Created a jira for the same - https://issues.apache.org/jira/browse/CARBONDATA-4308




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-934538663


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4256/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r731965589



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    if (operationType != null &&
+        !MergeOperationType.withName(operationType.toUpperCase).equals(MergeOperationType.INSERT) &&
+        filterDupes) {
+      throw new MalformedCarbonCommandException("property CARBON_STREAMER_INSERT_DEDUPLICATE" +
+                                                " should only be set with operation type INSERT")
+    }
+    val isSchemaEnforcementEnabled = properties
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (operationType != null) {
+      if (isSchemaEnforcementEnabled) {
+        // call the util function to verify if incoming schema matches with target schema
+        CarbonMergeDataSetUtil.verifySourceAndTargetSchemas(targetDsOri, srcDS)
+      } else {
+        CarbonMergeDataSetUtil.handleSchemaEvolution(
+          targetDsOri, srcDS, sparkSession)
+      }
+    }
+
     // Target dataset must be backed by carbondata table.
-    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val tgtTable = relations.head.carbonRelation.carbonTable
+    val targetCarbonTable: CarbonTable = CarbonEnv.getCarbonTable(Option(tgtTable.getDatabaseName),

Review comment:
       this is added to handle the cases where target table schema needs to be evolved. If some new column gets added, we want the updated target schema to be used henceforth so that values for new column will be populated without any issues. We are calling the handleSchemaEvolution() method just before this line of code. Hope that makes it clear. @Indhumathi27 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732228199



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    if (isPrimitiveAndNotDate) {
+      Comparator.getComparator(orderingFieldDataType)
+    } else if (orderingFieldDataType == DataTypes.STRING) {
+      null
+    } else {
+      Comparator.getComparatorByDataTypeForMeasure(orderingFieldDataType)
+    }
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = if ((keyColumnDataType eq DataTypes.BOOLEAN) ||

Review comment:
       There is a condition -> DataTypes.isDecimal(keyColumnDataType) at the end here. How do I take care of that with case match? @Indhumathi27 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-932025406


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/384/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931493597


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933386598


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/392/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-932136162


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-938549782


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/425/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733768604



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       why cant the filterDupes and isSchemaEnforcementEnabled cant be moved inside the check if (operationType != null) ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r736168790



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,365 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val tgtSchemaInLowerCase = targetSchema.fields.map(_.name.toLowerCase)
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          tgtSchemaInLowerCase.contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+  }
+
+  def verifyCaseSensitiveFieldNames(
+      lowerCaseSrcSchemaFields: Array[String]
+  ): Unit = {
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    verifyCaseSensitiveFieldNames(lowerCaseSrcSchemaFields)
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq,
+        sparkSession,
+        targetCarbonTable)
+    }
+
+    // check if any column got deleted from source
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) {
+      partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList
+    } else {
+      List[String]()
+    }
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession, targetCarbonTable)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs,
+        modifiedColumns.toList,
+        sparkSession,
+        targetCarbonTable)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession,
+      targetCarbonTable: CarbonTable): Unit = {
+    val alterTableAddColsCmd = DDLHelper.prepareAlterTableAddColsCommand(
+      Option(targetCarbonTable.getDatabaseName),
+      colsToAdd,
+      targetCarbonTable.getTableName.toLowerCase)
+    alterTableAddColsCmd.run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession,
+      targetCarbonTable: CarbonTable): Unit = {
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession,
+      targetCarbonTable: CarbonTable): Unit = {
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val alterTableColRenameDataTypeChangeCommand = DDLHelper
+        .prepareAlterTableColRenameDataTypeChangeCommand(
+        col,
+        Option(targetCarbonTable.getDatabaseName.toLowerCase),
+        targetCarbonTable.getTableName.toLowerCase,
+        col.name.toLowerCase,
+        isColumnRename = false,
+        Option.empty)
+      alterTableColRenameDataTypeChangeCommand.run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = Comparator.getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map { row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      var orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      var orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingValue1 == null) {
+        row2
+      } else if (orderingValue2 == null) {
+        row1
+      } else {
+        if (orderingFieldDataType.equals(DataTypes.STRING)) {
+          orderingValue1 = orderingValue1.toString
+            .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))
+          orderingValue2 = orderingValue2.toString
+            .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))
+        }
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = keyColumnDataType match {
+          case DataTypes.BOOLEAN | DataTypes.BYTE => value.toByte
+          case DataTypes.SHORT => value.toShort
+          case DataTypes.INT => value.toInt
+          case DataTypes.DOUBLE => 0d
+          case DataTypes.FLOAT => 0f
+          case DataTypes.LONG | DataTypes.TIMESTAMP => value
+          case _ => value
+        }
+        CarbonUtil.getValueAsBytes(keyColumnDataType, nullValueForMeasure)
+      }
+    }
+  }
+
+  def getCarbonDataType(
+      fieldName: String,
+      srcDs: Dataset[Row]
+  ): CarbonDataType = {
+    val schema = srcDs.schema
+    val dataType = schema.fields.find(f => f.name.equalsIgnoreCase(fieldName)).get.dataType
+    CarbonSparkDataSourceUtil.convertSparkToCarbonDataType(dataType)
+  }
+
+  def deduplicateAgainstExistingDataset(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String
+  ) : Dataset[Row] = {

Review comment:
       looks like some places, carbon data code style format is followed. Please check and format the newly added code




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r736169187



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2681,4 +2681,61 @@ private CarbonCommonConstants() {
 
   public static final String CARBON_CDC_MINMAX_PRUNING_ENABLED_DEFAULT = "false";
 
+  //////////////////////////////////////////////////////////////////////////////////////////
+  // CDC streamer configs start here
+  //////////////////////////////////////////////////////////////////////////////////////////
+
+  /**
+   * Name of the field from source schema whose value can be used for picking the latest updates for
+   * a particular record in the incoming batch in case of duplicates record keys. Useful if the
+   * write operation type is UPDATE or UPSERT. This will be used only if
+   * carbon.streamer.upsert.deduplicate is enabled.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_SOURCE_ORDERING_FIELD =
+      "carbon.streamer.source.ordering.field";
+
+  public static final String CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT = "";
+
+  /**
+   * This property specifies if the incoming batch needs to be deduplicated in case of INSERT
+   * operation type. If set to true, the incoming batch will be deduplicated against the existing
+   * data in the target carbondata table.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_INSERT_DEDUPLICATE =
+      "carbon.streamer.insert.deduplicate";
+
+  public static final String CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT = "false";
+
+  /**
+   * This property specifies if the incoming batch needs to be deduplicated (when multiple updates
+   * for the same record key are present in the incoming batch) in case of UPSERT/UPDATE operation
+   * type. If set to true, the user needs to provide proper value for the source ordering field as
+   * well.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_UPSERT_DEDUPLICATE =
+      "carbon.streamer.upsert.deduplicate";
+
+  public static final String CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT = "true";
+
+  /**
+   * The metadata columns coming from the source stream data, which should not be included in the
+   * target data.
+   */
+  @CarbonProperty
+  public static final String CARBON_STREAMER_META_COLUMNS = "carbon.streamer.meta.columns";
+
+  /**
+   * This flag decides if table schema needs to change as per the incoming batch schema.
+   * If set to true, incoming schema will be validated with existing table schema.
+   * If the schema has evolved, the incoming batch cannot be ingested and
+   * job will simply fail.
+   */
+  @CarbonProperty
+  public static final String CARBON_ENABLE_SCHEMA_ENFORCEMENT = "carbon.enable.schema.enforcement";

Review comment:
       please update the documentation for the newly added carbon properties




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732501440



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Right. Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-947558263


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/457/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-947542253


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6067/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734940636



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))

Review comment:
       Right, let me handle that. Got missed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] ydvpankaj99 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

ydvpankaj99 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-938436208


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-938361206


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

akashrn5 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732542483



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       @pratyakshsharma its just more clean and simple 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732180153



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {

Review comment:
       Thank you for pointing this out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

akashrn5 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728168954



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       i think we can replace line 614 to 616 with below code
   `val addedColumns = sourceSchema.fields
         .filterNot(f => targetSchema.fields.map(_.name).contains(f.name))`

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       i think we can replace line 614 to 616 with below code
   `val addedColumns = sourceSchema.fields
         .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))`

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {

Review comment:
       can you please add a comment here, why for only streamer tool we need to call this code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-953882673


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/503/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-953851879


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6113/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952023163


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4355/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733763672



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,357 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableAddColsCmd = DDLHelper.prepareAlterTableAddColsCommand(
+      Option(targetCarbonTable.getDatabaseName),
+      colsToAdd,
+      targetCarbonTable.getTableName.toLowerCase)
+    alterTableAddColsCmd.run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val alterTableColRenameDataTypeChangeCommand = DDLHelper
+        .prepareAlterTableColRenameDataTypeChangeCommand(
+        col,
+        Option(targetCarbonTable.getDatabaseName.toLowerCase),
+        targetCarbonTable.getTableName.toLowerCase,
+        col.name.toLowerCase,
+        isColumnRename = false,
+        Option.empty)
+      alterTableColRenameDataTypeChangeCommand.run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = Comparator.getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map { row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE

Review comment:
       i feel, still this code segment can be simplified like below.
   if(orderingValue1 == null) {
     row2
   } else if (orderingValue2 == null) {
    row1
   } else {
            if(orderingFieldDataType is stringType) {
                reassign orderingValue1 and orderingValue2 by
                converting value to bytes
           }
          if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
             row1
           } else {
             row2
           }
   }




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r735254278



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,360 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val tgtSchemaInLowerCase = targetSchema.fields.map(_.name.toLowerCase)
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          tgtSchemaInLowerCase.contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+  }
+
+  def verifyCaseSensitiveFieldNames(
+      lowerCaseSrcSchemaFields: Array[String]
+  ): Unit = {
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    verifyCaseSensitiveFieldNames(lowerCaseSrcSchemaFields)
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala

Review comment:
       please format the code with braces here

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,360 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val tgtSchemaInLowerCase = targetSchema.fields.map(_.name.toLowerCase)
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          tgtSchemaInLowerCase.contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+  }
+
+  def verifyCaseSensitiveFieldNames(
+      lowerCaseSrcSchemaFields: Array[String]
+  ): Unit = {
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    verifyCaseSensitiveFieldNames(lowerCaseSrcSchemaFields)
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)

Review comment:
       line:625, collecting target dataframe realations, can be reused inside this method (handleDeleteColumnScenario) and handleDataTypeChangeScenario method also

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,360 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val tgtSchemaInLowerCase = targetSchema.fields.map(_.name.toLowerCase)
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          tgtSchemaInLowerCase.contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+  }
+
+  def verifyCaseSensitiveFieldNames(
+      lowerCaseSrcSchemaFields: Array[String]
+  ): Unit = {
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))

Review comment:
       ```suggestion
           .find(f => f.name.equalsIgnoreCase(tgtField.name))
   ```

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,360 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val tgtSchemaInLowerCase = targetSchema.fields.map(_.name.toLowerCase)
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          tgtSchemaInLowerCase.contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+  }
+
+  def verifyCaseSensitiveFieldNames(
+      lowerCaseSrcSchemaFields: Array[String]
+  ): Unit = {
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    verifyCaseSensitiveFieldNames(lowerCaseSrcSchemaFields)
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,

Review comment:
       line:625, collecting target dataframe realations, can be reused inside this method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950860177


   @Indhumathi27 Please take a pass. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-951010235


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/484/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-934430307


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/401/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-940801143


   @pratyakshsharma Please add PR description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931515930


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4230/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933372025


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-938547832


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4290/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930426200


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4219/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930562190


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/366/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952802588


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4364/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-953749234


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732501107



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Right. Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732414629



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    if (isPrimitiveAndNotDate) {
+      Comparator.getComparator(orderingFieldDataType)
+    } else if (orderingFieldDataType == DataTypes.STRING) {
+      null
+    } else {
+      Comparator.getComparatorByDataTypeForMeasure(orderingFieldDataType)
+    }
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = if ((keyColumnDataType eq DataTypes.BOOLEAN) ||

Review comment:
       For Long, TimeStamp and Decimal, default value is used. so, default case can be used here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733699883



##########
File path: pom.xml
##########
@@ -469,6 +469,7 @@
           <findbugsXmlOutput>true</findbugsXmlOutput>
           <xmlOutput>true</xmlOutput>
           <effort>Max</effort>
+          <maxHeap>1024</maxHeap>

Review comment:
       sure will do.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732206905



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -847,20 +820,222 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
         Row("j", 2, "RUSSIA"), Row("k", 0, "INDIA")))
   }
 
-  test("test all the merge APIs UPDATE, DELETE, UPSERT and INSERT") {
+  def prepareTarget(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
     sql("drop table if exists target")
-    val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
       Row("a", "0"),
       Row("b", "1"),
       Row("c", "2"),
       Row("d", "3")
     ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
-    initframe.write
-      .format("carbondata")
-      .option("tableName", "target")
-      .mode(SaveMode.Overwrite)
-      .save()
-    val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  def prepareTargetWithThreeFields(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
+    sql("drop table if exists target")
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 0, "CHINA"),
+      Row("b", 1, "INDIA"),
+      Row("c", 2, "INDIA"),
+      Row("d", 3, "US")
+    ).asJava,
+      StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType),
+        StructField("country", StringType))))
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  test("test schema enforcement") {
+    val target = prepareTarget()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", "1", "ab"),
+      Row("d", "4", "de")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", StringType)
+      , StructField("new_value", StringType))))
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "true"
+    )
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", "1"), Row("b", "1"), Row("c", "2"), Row("d", "4")))
+
+    properties.addProperty(
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "true"
+    )
+
+    val exceptionCaught1 = intercept[MalformedCarbonCommandException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1, "ab"),
+        Row("d", 4, "de")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType)
+        , StructField("new_value", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught1.getMessage
+      .contains(
+        "property CARBON_STREAMER_INSERT_DEDUPLICATE should " +
+        "only be set with operation type INSERT"))
+
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    val exceptionCaught2 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("val", IntegerType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught2.getMessage.contains("source schema does not contain field: value"))
+
+    val exceptionCaught3 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", LongType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught3.getMsg.contains("source schema has different " +
+                                            "data type for field: value"))
+
+    val exceptionCaught4 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", "1", "A"),
+        Row("d", "4", "D")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", StringType), StructField("Key", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught4.getMsg.contains("source schema has similar fields which " +
+                                            "differ only in case sensitivity: key"))
+  }
+
+  test("test schema evolution") {
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD, "value"
+    )
+    sql("drop table if exists target")
+    var target = prepareTargetWithThreeFields()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 1, "ab", "china"),
+      Row("d", 4, "de", "china"),
+      Row("d", 7, "updated_de", "china_pro")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType)
+      , StructField("new_value", StringType),
+      StructField("country", StringType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 1, "china", "ab"), Row("b", 1, "INDIA", null),
+        Row("c", 2, "INDIA", null), Row("d", 7, "china_pro", "updated_de")))
+
+    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 5),
+      Row("d", 5)
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 5), Row("b", 1),
+        Row("c", 2), Row("d", 5)))
+
+//    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+//    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+//      Row("b", 50),
+//      Row("d", 50)

Review comment:
       Need to add a test case for data type change here. Will do that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948530655


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6078/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-966930068


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4371/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-968861503


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/508/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949357635


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930413780






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r748068793



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)

Review comment:
       please handle this comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r748073264



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)

Review comment:
       Can you suggest something here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-966930676


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/504/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732500910



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Right. Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948047670


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949390101


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/478/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949528094


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/479/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950151008


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4348/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950998625


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6094/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933479428


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5993/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931297536


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/372/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933487352


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4247/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952036516


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/488/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] ydvpankaj99 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

ydvpankaj99 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949289319


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949500764


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6089/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734935378



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))

Review comment:
       what if the fields are case sensitive like in "verifySourceAndTargetSchemas"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-966922573


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931500954


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5976/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732509973



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    if (isPrimitiveAndNotDate) {
+      Comparator.getComparator(orderingFieldDataType)
+    } else if (orderingFieldDataType == DataTypes.STRING) {
+      null
+    } else {
+      Comparator.getComparatorByDataTypeForMeasure(orderingFieldDataType)
+    }
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = if ((keyColumnDataType eq DataTypes.BOOLEAN) ||

Review comment:
       Ok, I had copied this code from some other class. Let me change it anyways.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732511637



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&

Review comment:
       Comparator.getComparator() will handle DATE type as well. Changing it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948471342


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4335/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948742086


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4341/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948857122


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/474/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-934535726


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6002/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930421671


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5964/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948410925


   @Indhumathi27 @akashrn5 please take a pass. Have taken care of the comments given.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948513336


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/468/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734244612



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,357 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableAddColsCmd = DDLHelper.prepareAlterTableAddColsCommand(
+      Option(targetCarbonTable.getDatabaseName),
+      colsToAdd,
+      targetCarbonTable.getTableName.toLowerCase)
+    alterTableAddColsCmd.run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val alterTableColRenameDataTypeChangeCommand = DDLHelper
+        .prepareAlterTableColRenameDataTypeChangeCommand(
+        col,
+        Option(targetCarbonTable.getDatabaseName.toLowerCase),
+        targetCarbonTable.getTableName.toLowerCase,
+        col.name.toLowerCase,
+        isColumnRename = false,
+        Option.empty)
+      alterTableColRenameDataTypeChangeCommand.run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = Comparator.getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map { row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE

Review comment:
       yeah actually this can be done. Thank you for the detailed review. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734935031



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)

Review comment:
       "targetSchema.fields.map(_.name.toLowerCase)" this should be outside the filterNot loop otherwise it would be done on each iteration of lowerCaseSrcSchemaFields 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732205221



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()

Review comment:
       Had added this for debugging, forgot to remove. Thank you for pointing this. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732207084



##########
File path: pom.xml
##########
@@ -130,7 +130,7 @@
     <scala.version>2.11.8</scala.version>
     <hadoop.deps.scope>compile</hadoop.deps.scope>
     <spark.version>2.3.4</spark.version>
-    <spark.binary.version>2.3</spark.binary.version>
+    <spark.binary.version>2.4</spark.binary.version>

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r731976677



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       yes.. right




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950150080


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6091/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-940838616


   @Indhumathi27 Updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] asfgit closed pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

asfgit closed pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952799337


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6107/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952828288


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/497/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-938553883


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

akashrn5 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728170904



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {

Review comment:
       can you please add a comment here, why for only streamer tool we need to call this code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732413708



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       Converting sourceSchema fields to Lowercase is inside loop (filternot()). So, source schema fields conversion to lower case = num of fields in target schema, which can be avoided




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950994622


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4351/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-948053433


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4333/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r732190679



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       But it is getting used only once @Indhumathi27 . How do I reuse this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733674567



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,370 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      }
+      else {

Review comment:
       please move this to previous line

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,370 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +

Review comment:
       if additionalSourceFields is empty, no need to print logs right ? please check and handle

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       the if statement can be moved before line:104, which can avoid reading properties

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,370 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      }
+      else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableAddColsCmd = DDLHelper.prepareAlterTableAddColsCommand(
+      Option(targetCarbonTable.getDatabaseName),
+      colsToAdd,
+      targetCarbonTable.getTableName.toLowerCase)
+    alterTableAddColsCmd.run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val alterTableColRenameDataTypeChangeCommand = DDLHelper
+        .prepareAlterTableColRenameDataTypeChangeCommand(
+        col,
+        Option(targetCarbonTable.getDatabaseName.toLowerCase),
+        targetCarbonTable.getTableName.toLowerCase,
+        col.name.toLowerCase,
+        isColumnRename = false,
+        Option.empty)
+      alterTableColRenameDataTypeChangeCommand.run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = Comparator.getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map { row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+//  def getComparator(
+//      orderingFieldDataType: CarbonDataType
+//  ): SerializableComparator = {
+//    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+//                                (orderingFieldDataType != DataTypes.DATE)

Review comment:
       please remove this commented code

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -976,6 +1151,10 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
       Row("d", "3")
     ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
 
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(

Review comment:
       ```suggestion
      CarbonProperties.getInstance().addProperty(
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] jackylk commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

jackylk commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-968891741


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-947545320


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4324/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-949300314


   @Indhumathi27 Adressed this last comment as well. I need to add a test case, will do that and ping you here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-933258477


   Build Failed  with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/389/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930567146


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5966/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931374221


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5972/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

akashrn5 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728168954



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       i think we can replace line 614 to 616 with below code
   `val addedColumns = sourceSchema.fields
         .filterNot(f => targetSchema.fields.map(_.name).contains(f.name))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-930568792


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4221/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-953866140


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4370/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-952008055


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6098/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733695900



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       actually the if condition is using the variables generated from reading the properties instance, so this is not possible :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r733684951



##########
File path: pom.xml
##########
@@ -469,6 +469,7 @@
           <findbugsXmlOutput>true</findbugsXmlOutput>
           <xmlOutput>true</xmlOutput>
           <effort>Max</effort>
+          <maxHeap>1024</maxHeap>

Review comment:
       This code change already merged to master branch. Please rebase




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734955345



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {

Review comment:
       yes, it is possible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] kunal642 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

kunal642 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r734935622



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +475,351 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    verifyBackwardsCompatibility(targetDs, srcDs)
+
+    val lowerCaseSrcSchemaFields = sourceSchema.fields.map(_.name.toLowerCase)
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = lowerCaseSrcSchemaFields
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      if (additionalSourceFields.nonEmpty) {
+        LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                    s"target schema: ${ additionalSourceFields.mkString(",") }")
+      }
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = lowerCaseSrcSchemaFields.groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      val errorMsg = s"source schema has similar fields which differ only in case sensitivity: " +
+                     s"${ similarFields.mkString(",") }"
+      LOGGER.error(errorMsg)
+      throw new CarbonSchemaException(errorMsg)
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) {
+        srcDs.drop(metaCols: _*)
+      } else {
+        srcDs
+      }
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      // check if some field is missing in source schema
+      if (sourceField.isEmpty) {
+        val errorMsg = s"source schema does not contain field: ${ tgtField.name }"
+        LOGGER.error(errorMsg)
+        throw new CarbonSchemaException(errorMsg)
+      }
+
+      // check if data type got modified for some column
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        val errorMsg = s"source schema has different data type " +
+                       s"for field: ${ tgtField.name }"
+        LOGGER.error(errorMsg + s", source type: ${ sourceField.get.dataType }, " +
+                     s"target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(errorMsg)
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    /*
+    If the method is called from CarbonStreamer, we need to ensure the schema is evolved in
+    backwards compatible way. In phase 1, only addition of columns is supported, hence this check is
+    needed to ensure data integrity.
+    The existing IUD flow supports full schema evolution, hence this check is not needed for
+     existing flows.
+     */
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        sourceSchema.fields.filter(f => addedColumns.contains(f)).toSeq, sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val srcSchemaFieldsInLowerCase = sourceSchema.fields.map(_.name.toLowerCase)
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        srcSchemaFieldsInLowerCase.contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {

Review comment:
       can sourceSchema have data type change and column deletion together?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-950156207


   Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/481/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

pratyakshsharma commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-968690929


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-968783866


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4375/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-968794161


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

Indhumathi27 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728023802



##########
File path: common/src/main/java/org/apache/carbondata/common/exceptions/sql/CarbonSchemaException.java
##########
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.common.exceptions.sql;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.common.annotations.InterfaceStability;
+
+@InterfaceAudience.User
+@InterfaceStability.Stable
+public class CarbonSchemaException extends Exception {
+
+  private static final long serialVersionUID = 1L;
+
+  private final String msg;
+
+  public CarbonSchemaException(String msg) {
+    super(msg);
+    this.msg = msg;
+  }
+
+  public String getMsg() {

Review comment:
       ```suggestion
     public String getMsg() {
   ```
   ```suggestion
     public String getMessage() {
   ```

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    if (operationType != null &&
+        !MergeOperationType.withName(operationType.toUpperCase).equals(MergeOperationType.INSERT) &&
+        filterDupes) {
+      throw new MalformedCarbonCommandException("property CARBON_STREAMER_INSERT_DEDUPLICATE" +
+                                                " should only be set with operation type INSERT")
+    }
+    val isSchemaEnforcementEnabled = properties
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (operationType != null) {
+      if (isSchemaEnforcementEnabled) {
+        // call the util function to verify if incoming schema matches with target schema
+        CarbonMergeDataSetUtil.verifySourceAndTargetSchemas(targetDsOri, srcDS)
+      } else {
+        CarbonMergeDataSetUtil.handleSchemaEvolution(
+          targetDsOri, srcDS, sparkSession)
+      }
+    }
+
     // Target dataset must be backed by carbondata table.
-    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val tgtTable = relations.head.carbonRelation.carbonTable
+    val targetCarbonTable: CarbonTable = CarbonEnv.getCarbonTable(Option(tgtTable.getDatabaseName),

Review comment:
       why this code is required ?

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
##########
@@ -98,8 +99,35 @@ case class CarbonMergeDataSetCommand(
       throw new UnsupportedOperationException(
         "Carbon table supposed to be present in merge dataset")
     }
+
+    val properties = CarbonProperties.getInstance()

Review comment:
       looks like, the validations added here, is applicable only when operationType is not Null. So, please move these validations inside  if check (operationType != null )

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {

Review comment:
       can create new variable for sourceSchema.fields.map(_.name.toLowerCase) and reuse in line:516 

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")

Review comment:
       can add exception message to a new variable and reuse. Please handle in all places applicable

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||

Review comment:
       can move sourceSchema.fields.map(_.name.toLowerCase) outside the loop to new variable and reuse

##########
File path: pom.xml
##########
@@ -130,7 +130,7 @@
     <scala.version>2.11.8</scala.version>
     <hadoop.deps.scope>compile</hadoop.deps.scope>
     <spark.version>2.3.4</spark.version>
-    <spark.binary.version>2.3</spark.binary.version>
+    <spark.binary.version>2.4</spark.binary.version>

Review comment:
       why this change ? please revert

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    if (isPrimitiveAndNotDate) {
+      Comparator.getComparator(orderingFieldDataType)
+    } else if (orderingFieldDataType == DataTypes.STRING) {
+      null
+    } else {
+      Comparator.getComparatorByDataTypeForMeasure(orderingFieldDataType)
+    }
+  }
+
+  def getRowKey(
+      row: Row,
+      index: Integer,
+      carbonKeyColumn: CarbonColumn,
+      isPrimitiveAndNotDate: Boolean,
+      keyColumnDataType: CarbonDataType
+  ): AnyRef = {
+    if (!row.isNullAt(index)) {
+      row.getAs(index).toString
+    } else {
+      val value: Long = 0
+      if (carbonKeyColumn.isDimension) {
+        if (isPrimitiveAndNotDate) {
+          CarbonCommonConstants.EMPTY_BYTE_ARRAY
+        } else {
+          CarbonCommonConstants.MEMBER_DEFAULT_VAL_ARRAY
+        }
+      } else {
+        val nullValueForMeasure = if ((keyColumnDataType eq DataTypes.BOOLEAN) ||

Review comment:
       can replace this with CASE MATCH

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>
+      val index = row.fieldIndex(keyColumn)
+      val rowKey = getRowKey(row, index, carbonKeyColumn, isPrimitiveAndNotDate, keyColumnDataType)
+      (rowKey, row)
+    }.reduceByKey{(row1, row2) =>
+      val orderingValue1 = row1.getAs(orderingField).asInstanceOf[Any]
+      val orderingValue2 = row2.getAs(orderingField).asInstanceOf[Any]
+      if (orderingFieldDataType.equals(DataTypes.STRING)) {
+        if (orderingValue1 == null) {
+          row2
+        } else if (orderingValue2 == null) {
+          row1
+        } else {
+          if (ByteUtil.UnsafeComparer.INSTANCE
+                .compareTo(orderingValue1.toString
+                  .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET)),
+                  orderingValue2.toString
+                    .getBytes(Charset.forName(CarbonCommonConstants.DEFAULT_CHARSET))) >= 0) {
+            row1
+          } else {
+            row2
+          }
+        }
+      } else {
+        if (comparator.compare(orderingValue1, orderingValue2) >= 0) {
+          row1
+        } else {
+          row2
+        }
+      }
+    }.map(_._2)
+    sparkSession.createDataFrame(dedupedRDD, schema).alias(srcAlias)
+  }
+
+  def getComparator(
+      orderingFieldDataType: CarbonDataType
+  ): SerializableComparator = {
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&

Review comment:
       1. I think, this method itself not needed. Comparator.getComparator can be directly called. For string type, ByteArraySerializableComparator can be used.
   2. Please refactor the caller method also
   3. For DATE type, might throw IllegalArgumentException. Please check and handle

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -1215,6 +1215,17 @@ private CarbonCommonConstants() {
 
   public static final String CARBON_ENABLE_BAD_RECORD_HANDLING_FOR_INSERT_DEFAULT = "false";
 
+  /**
+   * This flag decides if table schema needs to change as per the incoming batch schema.

Review comment:
       can move/group CDC related properties in same place

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {

Review comment:
       This check is not needed. Can move sourceField variable before this check and replace this check with sourceField.isempty

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {

Review comment:
       same code is present in verifySourceAndTargetSchemas method. Can move this code to common method and reuse

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -16,31 +16,41 @@
  */
 package org.apache.spark.sql.execution.command.mutation.merge
 
+import java.nio.charset.Charset
 import java.util
 
 import scala.collection.JavaConverters._
 import scala.collection.mutable
 
+import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{CarbonDatasourceHadoopRelation, Dataset, Row, SparkSession}
-import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.carbondata.execution.datasources.CarbonSparkDataSourceUtil
+import org.apache.spark.sql.catalyst.{CarbonParserUtil, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
 import org.apache.spark.sql.catalyst.expressions.EqualTo
 import org.apache.spark.sql.execution.CastExpressionOptimization
+import org.apache.spark.sql.execution.command.{AlterTableAddColumnsModel, AlterTableDataTypeChangeModel, AlterTableDropColumnModel}
+import org.apache.spark.sql.execution.command.schema.{CarbonAlterTableAddColumnCommand, CarbonAlterTableColRenameDataTypeChangeCommand, CarbonAlterTableDropColumnCommand}
+import org.apache.spark.sql.functions.expr
 import org.apache.spark.sql.optimizer.CarbonFilters
-import org.apache.spark.sql.types.DateType
+import org.apache.spark.sql.parser.CarbonSpark2SqlParser
+import org.apache.spark.sql.types.{DateType, DecimalType, StructField}
 
+import org.apache.carbondata.common.exceptions.sql.{CarbonSchemaException, MalformedCarbonCommandException}

Review comment:
       please remove unused import - MalformedCarbonCommandException

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")

Review comment:
       can move this to previous line

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))

Review comment:
       tgtField.name.toLowerCase - converting to lowercase is not needed, since equalsIgnoreCase check is used here

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,

Review comment:
       same code is available in DDLHelper.addColumns. can move common code to new method  and reuse

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)

Review comment:
       please add braces and format the code here

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,

Review comment:
       ```suggestion
           srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
   ```
   ```suggestion
           sourceSchema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
   ```
   after this, can format to single line

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()
+    dedupedDataset
+  }
+
+  def deduplicateAgainstIncomingDataset(
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      table: CarbonTable): Dataset[Row] = {
+    if (orderingField.equals(CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD_DEFAULT)) {
+      return srcDs
+    }
+    val schema = srcDs.schema
+    val carbonKeyColumn = table.getColumnByName(keyColumn)
+    val keyColumnDataType = getCarbonDataType(keyColumn, srcDs)
+    val orderingFieldDataType = getCarbonDataType(orderingField, srcDs)
+    val isPrimitiveAndNotDate = DataTypeUtil.isPrimitiveColumn(orderingFieldDataType) &&
+                                (orderingFieldDataType != DataTypes.DATE)
+    val comparator = getComparator(orderingFieldDataType)
+    val rdd = srcDs.rdd
+    val dedupedRDD: RDD[Row] = rdd.map{row =>

Review comment:
       please format this code

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {

Review comment:
       same code is available in DDLHelper.changeColumn. can move common code to new method and reuse
   

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
##########
@@ -847,20 +820,222 @@ class MergeTestCase extends QueryTest with BeforeAndAfterAll {
         Row("j", 2, "RUSSIA"), Row("k", 0, "INDIA")))
   }
 
-  test("test all the merge APIs UPDATE, DELETE, UPSERT and INSERT") {
+  def prepareTarget(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
     sql("drop table if exists target")
-    val initframe = sqlContext.sparkSession.createDataFrame(Seq(
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
       Row("a", "0"),
       Row("b", "1"),
       Row("c", "2"),
       Row("d", "3")
     ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", StringType))))
-    initframe.write
-      .format("carbondata")
-      .option("tableName", "target")
-      .mode(SaveMode.Overwrite)
-      .save()
-    val target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  def prepareTargetWithThreeFields(
+      isPartitioned: Boolean = false,
+      partitionedColumn: String = null
+  ): Dataset[Row] = {
+    sql("drop table if exists target")
+    val initFrame = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 0, "CHINA"),
+      Row("b", 1, "INDIA"),
+      Row("c", 2, "INDIA"),
+      Row("d", 3, "US")
+    ).asJava,
+      StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType),
+        StructField("country", StringType))))
+
+    if (isPartitioned) {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .option("partitionColumns", partitionedColumn)
+        .mode(SaveMode.Overwrite)
+        .save()
+    } else {
+      initFrame.write
+        .format("carbondata")
+        .option("tableName", "target")
+        .mode(SaveMode.Overwrite)
+        .save()
+    }
+    sqlContext.read.format("carbondata").option("tableName", "target").load()
+  }
+
+  test("test schema enforcement") {
+    val target = prepareTarget()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", "1", "ab"),
+      Row("d", "4", "de")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", StringType)
+      , StructField("new_value", StringType))))
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "true"
+    )
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", "1"), Row("b", "1"), Row("c", "2"), Row("d", "4")))
+
+    properties.addProperty(
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "true"
+    )
+
+    val exceptionCaught1 = intercept[MalformedCarbonCommandException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1, "ab"),
+        Row("d", 4, "de")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", IntegerType)
+        , StructField("new_value", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught1.getMessage
+      .contains(
+        "property CARBON_STREAMER_INSERT_DEDUPLICATE should " +
+        "only be set with operation type INSERT"))
+
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    val exceptionCaught2 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("val", IntegerType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+    assert(exceptionCaught2.getMessage.contains("source schema does not contain field: value"))
+
+    val exceptionCaught3 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", 1),
+        Row("d", 4)
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", LongType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught3.getMsg.contains("source schema has different " +
+                                            "data type for field: value"))
+
+    val exceptionCaught4 = intercept[CarbonSchemaException] {
+      cdc = sqlContext.sparkSession.createDataFrame(Seq(
+        Row("a", "1", "A"),
+        Row("d", "4", "D")
+      ).asJava, StructType(Seq(StructField("key", StringType),
+        StructField("value", StringType), StructField("Key", StringType))))
+      target.as("A").upsert(cdc.as("B"), "key").execute()
+    }
+
+    assert(exceptionCaught4.getMsg.contains("source schema has similar fields which " +
+                                            "differ only in case sensitivity: key"))
+  }
+
+  test("test schema evolution") {
+    val properties = CarbonProperties.getInstance()
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT, "false"
+    )
+    properties.addProperty(
+      CarbonCommonConstants.CARBON_STREAMER_SOURCE_ORDERING_FIELD, "value"
+    )
+    sql("drop table if exists target")
+    var target = prepareTargetWithThreeFields()
+    var cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 1, "ab", "china"),
+      Row("d", 4, "de", "china"),
+      Row("d", 7, "updated_de", "china_pro")
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType)
+      , StructField("new_value", StringType),
+      StructField("country", StringType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 1, "china", "ab"), Row("b", 1, "INDIA", null),
+        Row("c", 2, "INDIA", null), Row("d", 7, "china_pro", "updated_de")))
+
+    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+
+    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+      Row("a", 5),
+      Row("d", 5)
+    ).asJava, StructType(Seq(StructField("key", StringType),
+      StructField("value", IntegerType))))
+    target.as("A").upsert(cdc.as("B"), "key").execute()
+    checkAnswer(sql("select * from target"),
+      Seq(Row("a", 5), Row("b", 1),
+        Row("c", 2), Row("d", 5)))
+
+//    target = sqlContext.read.format("carbondata").option("tableName", "target").load()
+//    cdc = sqlContext.sparkSession.createDataFrame(Seq(
+//      Row("b", 50),
+//      Row("d", 50)

Review comment:
       please remove this code if not needed

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,

Review comment:
       remove this code

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))
+    if (addedColumns.nonEmpty) {
+      handleAddColumnScenario(targetDs,
+        srcDs.schema.fields.filter(f => addedColumns.contains(f.name)).toSeq,
+        sparkSession)
+    }
+
+    // check if any column got deleted from source
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val partitionInfo = targetCarbonTable.getPartitionInfo
+    val partitionColumns = if (partitionInfo != null) partitionInfo.getColumnSchemaList.asScala
+      .map(_.getColumnName).toList else List[String]()
+    val deletedColumns = targetSchema.fields.map(_.name.toLowerCase)
+      .filterNot(f => {
+        sourceSchema.fields.map(_.name.toLowerCase).contains(f) ||
+        partitionColumns.contains(f)
+      })
+    if (deletedColumns.nonEmpty) {
+      handleDeleteColumnScenario(targetDs, deletedColumns.toList, sparkSession)
+    }
+
+    val modifiedColumns = targetSchema.fields.filter(tgtField => {
+      val sourceField = sourceSchema.fields.find(f => f.name.equalsIgnoreCase(tgtField.name))
+      if (sourceField.isDefined) !sourceField.get.dataType.equals(tgtField.dataType) else false
+    })
+
+    if (modifiedColumns.nonEmpty) {
+      handleDataTypeChangeScenario(targetDs, modifiedColumns.toList, sparkSession)
+    }
+  }
+
+  /**
+   * This method calls CarbonAlterTableAddColumnCommand for adding new columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToAdd new columns to be added
+   * @param sparkSession SparkSession
+   */
+  def handleAddColumnScenario(targetDs: Dataset[Row], colsToAdd: Seq[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val fields = new CarbonSpark2SqlParser().getFields(colsToAdd)
+    val tableModel = CarbonParserUtil.prepareTableModel(ifNotExistPresent = false,
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      fields.map(CarbonParserUtil.convertFieldNamesToLowercase),
+      Seq.empty,
+      scala.collection.mutable.Map.empty[String, String],
+      None,
+      isAlterFlow = true)
+    //    targetCarbonTable.getAllDimensions.asScala.map(f => Field(column = f.getColName,
+    //      dataType = Some(f.getDataType.getName), name = Option(f.getColName),
+    //      children = None, ))
+    val alterTableAddColumnsModel = AlterTableAddColumnsModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      Map.empty[String, String],
+      tableModel.dimCols,
+      tableModel.msrCols,
+      tableModel.highCardinalityDims.getOrElse(Seq.empty))
+    CarbonAlterTableAddColumnCommand(alterTableAddColumnsModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableDropColumnCommand for deleting columns
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param colsToDrop columns to be dropped from carbondata table
+   * @param sparkSession SparkSession
+   */
+  def handleDeleteColumnScenario(targetDs: Dataset[Row], colsToDrop: List[String],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+    val alterTableDropColumnModel = AlterTableDropColumnModel(
+      CarbonParserUtil.convertDbNameToLowerCase(Option(targetCarbonTable.getDatabaseName)),
+      targetCarbonTable.getTableName.toLowerCase,
+      colsToDrop.map(_.toLowerCase))
+    CarbonAlterTableDropColumnCommand(alterTableDropColumnModel).run(sparkSession)
+  }
+
+  /**
+   * This method calls CarbonAlterTableColRenameDataTypeChangeCommand for handling data type changes
+   * @param targetDs target dataset whose schema needs to be modified
+   * @param modifiedCols columns with data type changes
+   * @param sparkSession SparkSession
+   */
+  def handleDataTypeChangeScenario(targetDs: Dataset[Row], modifiedCols: List[StructField],
+      sparkSession: SparkSession): Unit = {
+    val relations = CarbonSparkUtil.collectCarbonRelation(targetDs.logicalPlan)
+    val targetCarbonTable = relations.head.carbonRelation.carbonTable
+
+    // need to call the command one by one for each modified column
+    modifiedCols.foreach(col => {
+      val values = col.dataType match {
+        case d: DecimalType => Some(List((d.precision, d.scale)))
+        case _ => None
+      }
+      val dataTypeInfo = CarbonParserUtil.parseColumn(col.name, col.dataType, values)
+
+      val alterTableColRenameAndDataTypeChangeModel =
+        AlterTableDataTypeChangeModel(
+          dataTypeInfo,
+          Option(targetCarbonTable.getDatabaseName.toLowerCase),
+          targetCarbonTable.getTableName.toLowerCase,
+          col.name.toLowerCase,
+          col.name.toLowerCase,
+          isColumnRename = false,
+          Option.empty)
+
+      CarbonAlterTableColRenameDataTypeChangeCommand(
+        alterTableColRenameAndDataTypeChangeModel
+      ).run(sparkSession)
+    })
+  }
+
+  def deduplicateBeforeWriting(
+      srcDs: Dataset[Row],
+      targetDs: Dataset[Row],
+      sparkSession: SparkSession,
+      srcAlias: String,
+      targetAlias: String,
+      keyColumn: String,
+      orderingField: String,
+      targetCarbonTable: CarbonTable): Dataset[Row] = {
+    val properties = CarbonProperties.getInstance()
+    val filterDupes = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_INSERT_DEDUPLICATE_DEFAULT).toBoolean
+    val combineBeforeUpsert = properties
+      .getProperty(CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE,
+        CarbonCommonConstants.CARBON_STREAMER_UPSERT_DEDUPLICATE_DEFAULT).toBoolean
+    var dedupedDataset: Dataset[Row] = srcDs
+    if (combineBeforeUpsert) {
+      dedupedDataset = deduplicateAgainstIncomingDataset(srcDs, sparkSession, srcAlias, keyColumn,
+        orderingField, targetCarbonTable)
+    }
+    if (filterDupes) {
+      dedupedDataset = deduplicateAgainstExistingDataset(dedupedDataset, targetDs,
+        srcAlias, targetAlias, keyColumn)
+    }
+    dedupedDataset.show()

Review comment:
       remove this line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-931379510


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4227/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4227: [CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Posted by GitBox <gi...@apache.org>.

akashrn5 commented on a change in pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#discussion_r728168954



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
##########
@@ -462,4 +474,413 @@ object CarbonMergeDataSetUtil {
       columnMinMaxInBlocklet.asScala
     }
   }
+
+  /**
+   * This method verifies source and target schemas for the following:
+   * If additional columns are present in source schema as compared to target, simply ignore them.
+   * If some columns are missing in source schema as compared to target schema, exception is thrown.
+   * If data type of some column differs in source and target schemas, exception is thrown.
+   * If source schema has multiple columns whose names differ only in case sensitivity, exception
+   * is thrown.
+   * @param targetDs target carbondata table
+   * @param srcDs source/incoming data
+   */
+  def verifySourceAndTargetSchemas(targetDs: Dataset[Row], srcDs: Dataset[Row]): Unit = {
+    LOGGER.info("schema enforcement is enabled. Source and target schemas will be verified")
+    // get the source and target dataset schema
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+
+    // check if some additional column got added in source schema
+    if (sourceSchema.fields.length > targetSchema.fields.length) {
+      val additionalSourceFields = sourceSchema.fields.map(_.name.toLowerCase)
+        .filterNot(srcField => {
+          targetSchema.fields.map(_.name.toLowerCase).contains(srcField)
+        })
+      LOGGER.warn(s"source schema contains additional fields which are not present in " +
+                  s"target schema: ${ additionalSourceFields.mkString(",") }")
+    }
+
+    // check if source schema has fields whose names only differ in case sensitivity
+    val similarFields = sourceSchema.fields.map(_.name.toLowerCase).groupBy(a => identity(a)).map {
+      case (str, times) => (str, times.length)
+    }.toList.filter(e => e._2 > 1).map(_._1)
+    if (similarFields.nonEmpty) {
+      LOGGER.error(s"source schema has similar fields which differ only in case sensitivity: " +
+                   s"${ similarFields.mkString(",") }")
+      throw new CarbonSchemaException(s"source schema has similar fields which differ" +
+                                                s" only in case sensitivity: ${
+                                                  similarFields.mkString(",")
+                                                }")
+    }
+  }
+
+  /**
+   * This method takes care of handling schema evolution scenarios for CarbonStreamer class.
+   * Currently only addition of columns is supported.
+   * @param targetDs target dataset whose schema needs to be modified, if applicable
+   * @param srcDs incoming dataset
+   * @param sparkSession SparkSession
+   */
+  def handleSchemaEvolutionForCarbonStreamer(targetDs: Dataset[Row], srcDs: Dataset[Row],
+      sparkSession: SparkSession): Unit = {
+    // read the property here
+    val isSchemaEnforcementEnabled = CarbonProperties.getInstance()
+      .getProperty(CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT,
+        CarbonCommonConstants.CARBON_ENABLE_SCHEMA_ENFORCEMENT_DEFAULT).toBoolean
+    if (isSchemaEnforcementEnabled) {
+      verifySourceAndTargetSchemas(targetDs, srcDs)
+    } else {
+      // These meta columns should be removed before actually writing the data
+      val metaColumnsString = CarbonProperties.getInstance()
+        .getProperty(CarbonCommonConstants.CARBON_STREAMER_META_COLUMNS, "")
+      val metaCols = metaColumnsString.split(",").map(_.trim)
+      val srcDsWithoutMeta = if (metaCols.length > 0) srcDs.drop(metaCols: _*)
+      else srcDs
+      handleSchemaEvolution(targetDs, srcDsWithoutMeta, sparkSession, isStreamerInvolved = true)
+    }
+  }
+
+  def verifyBackwardsCompatibility(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row]): Unit = {
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    targetSchema.fields.foreach(tgtField => {
+      // check if some field is missing in source schema
+      if (!sourceSchema.fields.map(_.name.toLowerCase).contains(tgtField.name.toLowerCase)) {
+        LOGGER.error(s"source schema does not contain field: ${ tgtField.name }")
+        throw new CarbonSchemaException(s"source schema does not contain " +
+                                                  s"field: ${ tgtField.name }")
+      }
+
+      // check if data type got modified for some column
+      val sourceField = sourceSchema.fields
+        .find(f => f.name.equalsIgnoreCase(tgtField.name.toLowerCase))
+      if (!sourceField.get.dataType.equals(tgtField.dataType)) {
+        LOGGER.error(s"source schema has different data type for field: ${
+          tgtField.name
+        }, source type: ${ sourceField.get.dataType }, target type: ${ tgtField.dataType }")
+        throw new CarbonSchemaException(s"source schema has different data type " +
+                                                  s"for field: ${ tgtField.name }")
+      }
+    })
+  }
+
+  /**
+   * The method takes care of following schema evolution cases:
+   * Addition of a new column in source schema which is not present in target
+   * Deletion of a column in source schema which is present in target
+   * Data type changes for an existing column.
+   * The method does not take care of column renames and table renames
+   * @param targetDs existing target dataset
+   * @param srcDs incoming source dataset
+   * @return new target schema to write the incoming batch with
+   */
+  def handleSchemaEvolution(
+      targetDs: Dataset[Row],
+      srcDs: Dataset[Row],
+      sparkSession: SparkSession,
+      isStreamerInvolved: Boolean = false): Unit = {
+
+    if (isStreamerInvolved) {
+      verifyBackwardsCompatibility(targetDs, srcDs)
+    }
+    val sourceSchema = srcDs.schema
+    val targetSchema = targetDs.schema
+
+    // check if any column got added in source
+    val addedColumns = sourceSchema.fields
+      .map(_.name)
+      .filterNot(f => targetSchema.fields.map(_.name).contains(f))

Review comment:
       i think we can replace line 614 to 616 with below code
   `val addedColumns = sourceSchema.fields
         .filterNot(field => targetSchema.fields.map(_.name).contains(field.name))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4227: [WIP]: schema evolution test cases w/o data type change working

Posted by GitBox <gi...@apache.org>.

CarbonDataQA2 commented on pull request #4227:
URL: https://github.com/apache/carbondata/pull/4227#issuecomment-932152805


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4239/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org