You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/24 11:34:44 UTC

[GitHub] [hudi] danielfordfc opened a new issue, #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

danielfordfc opened a new issue, #7049:
URL: https://github.com/apache/hudi/issues/7049

   We are using Hudi 0.11.0 Hudi Deltastreamer on emr-6.7.0 to read data in from our Confluent Kafka cluster w/ Schema registry , and write it to a Glue Catalog table to be queried through Athena.
   
   Our spark-submit command is as follows:
   
   ```
   "spark-submit",
                   "--jars",
                   "/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar",
                   "--class",
                   "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
                   "--conf",
                   "spark.serializer=org.apache.spark.serializer.KryoSerializer",
                   "--conf",
                   "spark.sql.catalogImplementation=hive",
                   "--conf",
                   "spark.sql.hive.convertMetastoreParquet=false",
                   "--conf",
                   "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
                   "--conf",
                   "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog",
                   "--conf",
            "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
                   "/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.0-amzn-0.jar",
                   "--props",
                   f"s3://{bucket}/configs/{source}.properties",
                   "--source-class",
                   "org.apache.hudi.utilities.sources.AvroKafkaSource",
                   "--target-base-path",
                   f"s3://{bucket}/{source}/raw",
                   "--target-table",
                   source,
                   "--schemaprovider-class",
                   "org.apache.hudi.utilities.schema.SchemaRegistryProvider",
                   "--transformer-class",
                   "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
                   "--source-ordering-field",
                   "published_at",
                   #"--enable-sync",
   ```
   
   and our hudi properties file, that we've been trying to find the correct configuration for,  is as follows:
   
   ```
   #Example NonPartitionedGenerator config 
   hoodie.datasource.hive_sync.database=pandas_raw
   hoodie.database.name=pandas_raw
   hoodie.table.name=funding_channel_failed_allocation
   hoodie.datasource.hive_sync.table=funding_channel_failed_allocation
   hoodie.deltastreamer.transformer.sql=select id, published_at FROM <SRC>
   hoodie.datasource.write.precombine.field=published_at
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   # hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   # hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.write.partitionpath.field=''
   hoodie.datasource.write.table.type=COPY_ON_WRITE
   hoodie.datasource.write.operation=UPSERT
   # hoodie.datasource.hive_sync.enable=true
   # hoodie.datasource.hive_sync.mode=hms
   # hoodie.datasource.hive_sync.sync_as_datasource=false
   # hoodie.datasource.hive_sync.use_jdbc=false
   # hoodie.datasource.hive_sync.use_pre_apache_input_format=true
   hoodie.avro.schema.validate=false
   hoodie.schema.on.read.enable=true
   ```
   
   We've been commenting and uncommenting the above fields to try and find a combination that works.
   
   Our issue is that the `hoodie.deltastreamer.transformer.sql` statement doesn't appear to be having any effect on the output parquet? The parquet files when opened do not contain the added column, and the Glue Catalog obviously doesn't show this added column either.
   
   I'm unsure if I'm conflicting in the configuration in some way.
   
   In the above  configuration, the EMR job step complains about:
   `Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Unable to validate the rewritten record {"id": "{REDACTED_UUID}", "published_at": 1464267354683} against schema {OUR SOURCE SCHEMA FETCHED FROM OUR REGISTRY}`
   
   so it is TRYING to do something, and this is likely due to a non backwards compatible schema change. If we try to make a backwards compatible change like
   
   `hoodie.deltastreamer.transformer.sql=select *, '1' AS test_field FROM <SRC>`
   
   there are no errors, but nothing happens. The table and parquet files don't contain the data.
   
   This is confusing considering that.
   
   `hoodie.avro.schema.validate=false`
   
   Hoping there's an expert out there on this issue? Ideally we'd like to be able to derive a field using the SQL transformer that we then use for the partitioning strategy, but I'd like to just see the transformer working for starters!
   
   
   **Environment Description**
   Using Hudi deltastreamer on EMR-6.7.0 on AWS.
   
   * Hudi version : 0.11.0
   * Spark version : 3.2.1
   * Hive version : 3.1.3
   * Hadoop version : Amazon 3.2.1
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : N
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1293301757

   @danielfordfc to confirm: the hudi 0.11.0 is the one pre-installed on EMR 6.7?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1302871037

   thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danielfordfc commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

danielfordfc commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1297247555

   Thank you very much! I'll give this a go and report back 🖖 . 
   
   For what its worth, we solved our further use case that I mention in this ticket without using the SQL transform, so this is no longer urgent for us


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danielfordfc commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

danielfordfc commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1293303211

   Yep! https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-670-release.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1295692772

   issue could be that, schema provider could be returning original schema w/o the `test_field" that you are adding as part of the transformer. 
   If you are can't match the output schema after transformation, then better not set any schema provider. And hudi will infer based on Dataset<Row>'s schema(after transformation). 
   
   I tried locally and this query worked out for me. 
   ```
   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT *, '1' AS test_field FROM <SRC> a \""
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
URL: https://github.com/apache/hudi/issues/7049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org