You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/14 07:31:34 UTC

[GitHub] [hudi] n3nash commented on issue #3048: [SUPPORT]delta streamer run bootstrap from one hudi table to another error .parquet is not a Parquet file. expected magic number at tail

n3nash commented on issue #3048:
URL: https://github.com/apache/hudi/issues/3048#issuecomment-859308186


   @fengjian428 Since your source table is Hudi, you cannot use `SparkParquetBootstrapDataProvider` to read the data. The `SparkParquetBootstrapDataProvider` is used to read a normal parquet table. Since new data can be coming into your source Hudi table, the `SparkParquetBootstrapDataProvider` may end up reading a parquet file that is being written to Hudi table and hence not providing snapshot isolation. 
   
   One way to go about this is to implement a `HoodieBootstrapDataProvider` that can read Hudi tables internally using the Spark Datasource, something like 
   ```
   source = spark.read.format(hudi)...
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org