You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/31 15:27:30 UTC

[GitHub] [hudi] nsivabalan opened a new issue #5189: [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta colunms

nsivabalan opened a new issue #5189:
URL: https://github.com/apache/hudi/issues/5189


   **_Tips before filing an issue_**
   
   **Describe the problem you faced**
   
   From user: 
   I am trying to read a hoodie table and write to a hoodie table using delta streamer and I am getting this error:
   
   
   Steps to reproduce:
   ```
   create first hudi table using ConfluentAvroKafkaSource ->
   second by HoodieIncrSource consuming output of first table -> 
   third by HoodieIncrSource and consuming output of second table 
   ( error is on incremental runs of deltastreamer in 3rd table )
   ```
   
   stacktrace: 
   
   ```
   client token: N/A
   	 diagnostics: User class threw exception: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_partition_path`;
   	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:90)
   	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:70)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:440)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
   	at com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
   	at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
   	at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
   	at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:95)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:388)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
   	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:728)
   ```
   
   
   write configs: 
   ```
   spark-submit --master yarn --jars /usr/lib/spark/external/lib/spark-avro.jar,s3://***/jars/hudi-utilities-bundle_2.12-0.10.0.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.cores=3 --conf spark.driver.memory=4g --conf spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1800m --conf spark.executor.memory=16g --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=6 --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --conf spark.sql.catalogImplementation=hive --d
 eploy-mode cluster s3://*****/jars/deltastreamer-addons-1.1-SNAPSHOT.jar --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=10 --hoodie-conf hoodie.datasource.write.partitionpath.field= --table-type COPY_ON_WRITE --source-class com.navi.sources.HoodieIncrSource --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://*****/input_path --hoodie-conf hoodie.metrics.on=true --hoodie-conf hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf hoodie.metrics.pushgateway.host=pushgateway.prod.navi-tech.in --hoodie-conf hoodie.metrics.pushgateway.port=443 --hoodie-conf hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf hoodie.metrics.pushgateway.job.name=*** --hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf hoodie.metrics.reporter.metricsname.prefix=hudi --target-base-path s3://*****/output_path --target-table some_table --enable-sync --hoodie-conf hoodie.da
 tasource.hive_sync.database=db --hoodie-conf hoodie.datasource.hive_sync.table=out_tbl --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.clustering.inline=true --hoodie-conf hoodie.clustering.inline.max.commits=2 --hoodie-conf hoodie.datasource.write.recordkey.field=contact_number_cleaned --hoodie-conf hoodie.datasource.write.precombine.field=id --hoodie-conf hoodie.datasource.clustering.inline.enable=true --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=134217728 --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=273741824 --source-ordering-field id --hoodie-conf transformer.normalize.json.column=id --hoodie-conf "hoodie.deltastreamer.transformer.sql=select id,col1,col2 from <SRC>)" --t
 ransformer-class com.custom.transform.ArrayJsonToStructTypeTransformer,org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
   
   ```
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] t0il3ts0ap commented on issue #5189: [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column

Posted by GitBox <gi...@apache.org>.
t0il3ts0ap commented on issue #5189:
URL: https://github.com/apache/hudi/issues/5189#issuecomment-1085412449


   ```
   Environment Description
   
       Hudi version : 0.10.0
   
       Spark version : 3.0.1
   
       Hive version : Hive 3.1.2
   
       Hadoop version :
   
       Storage (HDFS/S3/GCS..) : S3
   
       Running on Docker? (yes/no) : no
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org