You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Brandon Scheller (Jira)" <ji...@apache.org> on 2020/08/05 16:23:00 UTC

[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

    [ https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171596#comment-17171596 ] 

Brandon Scheller commented on HUDI-1146:
----------------------------------------

spark-submit 
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ 
--master yarn --deploy-mode client \ 
--table-type COPY_ON_WRITE \ 
--continuous \ 
--enable-hive-sync \ 
--min-sync-interval-seconds 60 \ 
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ 
--target-base-path s3://pathtotable/table/ \
--target-table hudi_table \ 
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ 
--hoodie-conf hoodie.datasource.write.recordkey.field=“XXXX” \ 
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ 
--hoodie-conf hoodie.datasource.write.partitionpath.field=“XXX” \ 
--hoodie-conf hoodie.datasource.hive_sync.database=xxxxx \ 
--hoodie-conf hoodie.datasource.hive_sync.table=xxxxxx \ 
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=“XXXXX” \ 
--hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ 
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://pathtoinput/xxx/

> DeltaStreamer fails to start when No updated records + schemaProvider not supplied
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1146
>                 URL: https://issues.apache.org/jira/browse/HUDI-1146
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Hive Integration
>            Reporter: Brandon Scheller
>            Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!}}
> {{ at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)