You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/24 11:34:44 UTC
[GitHub] [hudi] danielfordfc opened a new issue, #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
danielfordfc opened a new issue, #7049:
URL: https://github.com/apache/hudi/issues/7049
We are using Hudi 0.11.0 Hudi Deltastreamer on emr-6.7.0 to read data in from our Confluent Kafka cluster w/ Schema registry , and write it to a Glue Catalog table to be queried through Athena.
Our spark-submit command is as follows:
```
"spark-submit",
"--jars",
"/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar",
"--class",
"org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
"--conf",
"spark.serializer=org.apache.spark.serializer.KryoSerializer",
"--conf",
"spark.sql.catalogImplementation=hive",
"--conf",
"spark.sql.hive.convertMetastoreParquet=false",
"--conf",
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
"--conf",
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog",
"--conf",
"spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
"/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.0-amzn-0.jar",
"--props",
f"s3://{bucket}/configs/{source}.properties",
"--source-class",
"org.apache.hudi.utilities.sources.AvroKafkaSource",
"--target-base-path",
f"s3://{bucket}/{source}/raw",
"--target-table",
source,
"--schemaprovider-class",
"org.apache.hudi.utilities.schema.SchemaRegistryProvider",
"--transformer-class",
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
"--source-ordering-field",
"published_at",
#"--enable-sync",
```
and our hudi properties file, that we've been trying to find the correct configuration for, is as follows:
```
#Example NonPartitionedGenerator config
hoodie.datasource.hive_sync.database=pandas_raw
hoodie.database.name=pandas_raw
hoodie.table.name=funding_channel_failed_allocation
hoodie.datasource.hive_sync.table=funding_channel_failed_allocation
hoodie.deltastreamer.transformer.sql=select id, published_at FROM <SRC>
hoodie.datasource.write.precombine.field=published_at
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
# hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
# hoodie.datasource.write.hive_style_partitioning=true
hoodie.datasource.write.partitionpath.field=''
hoodie.datasource.write.table.type=COPY_ON_WRITE
hoodie.datasource.write.operation=UPSERT
# hoodie.datasource.hive_sync.enable=true
# hoodie.datasource.hive_sync.mode=hms
# hoodie.datasource.hive_sync.sync_as_datasource=false
# hoodie.datasource.hive_sync.use_jdbc=false
# hoodie.datasource.hive_sync.use_pre_apache_input_format=true
hoodie.avro.schema.validate=false
hoodie.schema.on.read.enable=true
```
We've been commenting and uncommenting the above fields to try and find a combination that works.
Our issue is that the `hoodie.deltastreamer.transformer.sql` statement doesn't appear to be having any effect on the output parquet? The parquet files when opened do not contain the added column, and the Glue Catalog obviously doesn't show this added column either.
I'm unsure if I'm conflicting in the configuration in some way.
In the above configuration, the EMR job step complains about:
`Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Unable to validate the rewritten record {"id": "{REDACTED_UUID}", "published_at": 1464267354683} against schema {OUR SOURCE SCHEMA FETCHED FROM OUR REGISTRY}`
so it is TRYING to do something, and this is likely due to a non backwards compatible schema change. If we try to make a backwards compatible change like
`hoodie.deltastreamer.transformer.sql=select *, '1' AS test_field FROM <SRC>`
there are no errors, but nothing happens. The table and parquet files don't contain the data.
This is confusing considering that.
`hoodie.avro.schema.validate=false`
Hoping there's an expert out there on this issue? Ideally we'd like to be able to derive a field using the SQL transformer that we then use for the partitioning strategy, but I'd like to just see the transformer working for starters!
**Environment Description**
Using Hudi deltastreamer on EMR-6.7.0 on AWS.
* Hudi version : 0.11.0
* Spark version : 3.2.1
* Hive version : 3.1.3
* Hadoop version : Amazon 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : N
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1293301757
@danielfordfc to confirm: the hudi 0.11.0 is the one pre-installed on EMR 6.7?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1302871037
thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danielfordfc commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
danielfordfc commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1297247555
Thank you very much! I'll give this a go and report back 🖖 .
For what its worth, we solved our further use case that I mention in this ticket without using the SQL transform, so this is no longer urgent for us
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danielfordfc commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
danielfordfc commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1293303211
Yep! https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-670-release.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #7049:
URL: https://github.com/apache/hudi/issues/7049#issuecomment-1295692772
issue could be that, schema provider could be returning original schema w/o the `test_field" that you are adding as part of the transformer.
If you are can't match the output schema after transformation, then better not set any schema provider. And hudi will infer based on Dataset<Row>'s schema(after transformation).
I tried locally and this query worked out for me.
```
--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT *, '1' AS test_field FROM <SRC> a \""
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #7049: [SUPPORT] SQLQueryBasedTransformer Not writing transformed parquet data
URL: https://github.com/apache/hudi/issues/7049
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org