You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/06/08 03:56:19 UTC

[GitHub] [hudi] fengjian428 opened a new issue #3048: [SUPPORT]delta streamer run bootstrap from one hudi table to another error .parquet is not a Parquet file. expected magic number at tail

fengjian428 opened a new issue #3048:
URL: https://github.com/apache/hudi/issues/3048


   I want migrate old table's data to new one . old table in COW mode, new one in MOR. and their partition also different. when I ran command 
   error is : java.lang.RuntimeException: hdfs://R2/projects/db__item_v4_tab/bucket_id=1020/e1578b9c-16c7-4b3d-aa64-1f970b961f87-0_10200-48-180878_20210602103323.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-81, -87, 15, 0]
   	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
   	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
   	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
   	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
   
   
   
   command is :
   spark-submit --master yarn --deploy-mode cluster --queue nonlive --conf spark.yarn.maxAppAttempts=1 \ --driver-memory 20g --driver-cores 2 --executor-memory 15g --executor-cores 2 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.8.0 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer hudi-utilities-bundle_2.11-0.8.0.jar \ --table-type MERGE_ON_READ \ --run-bootstrap \ --target-base-path /projects/bootdb__item_v4_tab \ --target-table bootdb__item_v4_tab \ --hoodie-conf hoodie.bootstrap.base.path=/projects/db__item_v4_tab \ --hoodie-conf hoodie.datasource.write.recordkey.field=itemid \ --source-class org.apache.hudi.utilities.sources.JsonDFSSource \ --source-ordering-field _event.ts \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=/tmp/config/source.avsc \ --hoodie-conf hoodie.deltastrea
 mer.schemaprovider.target.schema.file=/tmp/config/target.avsc \ --initial-checkpoint-provider org.apache.hudi.utilities.checkpointing.InitialCheckpointFromAnotherHoodieTimelineProvider \ --checkpoint /projects/db__item_v4_tab/ \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --hoodie-conf hoodie.deltastreamer.transformer.sql="Select *,cast(from_unixtime(_event.ts,'YYYY-MM-dd-HH') as string) grass_date from <SRC>" \ --hoodie-conf hoodie.datasource.write.partitionpath.field=grass_date \ --hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \ --hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \ --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash closed issue #3048: [SUPPORT]delta streamer run bootstrap from one hudi table to another error .parquet is not a Parquet file. expected magic number at tail

Posted by GitBox <gi...@apache.org>.

n3nash closed issue #3048:
URL: https://github.com/apache/hudi/issues/3048


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #3048: [SUPPORT]delta streamer run bootstrap from one hudi table to another error .parquet is not a Parquet file. expected magic number at tail

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #3048:
URL: https://github.com/apache/hudi/issues/3048#issuecomment-859308186


   @fengjian428 Since your source table is Hudi, you cannot use `SparkParquetBootstrapDataProvider` to read the data. The `SparkParquetBootstrapDataProvider` is used to read a normal parquet table. Since new data can be coming into your source Hudi table, the `SparkParquetBootstrapDataProvider` may end up reading a parquet file that is being written to Hudi table and hence not providing snapshot isolation. 
   
   One way to go about this is to implement a `HoodieBootstrapDataProvider` that can read Hudi tables internally using the Spark Datasource, something like 
   ```
   source = spark.read.format(hudi)...
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #3048: [SUPPORT]delta streamer run bootstrap from one hudi table to another error .parquet is not a Parquet file. expected magic number at tail

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #3048:
URL: https://github.com/apache/hudi/issues/3048#issuecomment-866463882


   @fengjian428 Closing this due to inactivity. Please feel free to re-open if you need assistance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org