You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/27 20:43:19 UTC

[GitHub] [hudi] jardel-lima opened a new issue #3879: [SUPPORT] Incomplete Table Migration

jardel-lima opened a new issue #3879:
URL: https://github.com/apache/hudi/issues/3879


   
   **Describe the problem you faced**
   
   I am trying to migrate some tables to hudi format, but I am facing some issues. We have a 7GB (snnapy compacted) table with 200M rows, 49 columns and just one partition. Using PySpark DataSource the migration finished without any error, although I notice that about 10000 rows were missing in the hudi table. I have tried to migrate the table again, but the same issue happened. 
   
   **To Reproduce**
   
   Migrate a huge table with a single partition using bulk_insert operation.
   
   **Expected behavior**
   
   I expected that all row would be migrated
   
   **Environment Description**
   
   * Hudi version : 0.9
   
   * Spark version : 3.0.0
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   - We were able to migrate others table without any issue;
   - The record key used is unique throughout the table;
   -  For each try different records were not migrated.  As example, the record A was migrated in the first try but it was not migrated in the second try;
   - The same number of register was not migrated in both tries;
   - I cleaned up all data from before each try.
   
   Hudi Options:
   ```
   {'hoodie.table.name': 'table_a',
    'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'KEY', 
   'hoodie.datasource.write.partitionpath.field': 'YEAR', 
   'hoodie.datasource.write.precombine.field': 'REF_DATE',
    'hoodie.datasource.write.hive_style_partitioning': 'true', 
   'hoodie.datasource.hive_sync.enable': 'true', 
   'hoodie.datasource.hive_sync.database': 'database_a', 
   'hoodie.datasource.hive_sync.table': 'table_a', 
   'hoodie.datasource.hive_sync.partition_fields': 'YEAR', 
   'hoodie.datasource.hive_sync.support_timestamp': 'true', 
   'hoodie.bulkinsert.shuffle.parallelism': 17,
    'hoodie.cleaner.commits.retained': 3, 
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-991847833


   you can share it here if its simple script or via github gists. Or if you want to share data, may be google drive link or something where I have access to download the data. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1039651290


   Closing this as not reproducible for now. But would definitely curious on this end. If you can help us w/ a dataset to reproduce the issue, do let us know. we taken data missing very seriously. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1007434803






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1039651290


   Closing this as not reproducible for now. But would definitely curious on this end. If you can help us w/ a dataset to reproduce the issue, do let us know. we taken data missing very seriously. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997135231


   @jardel-lima : let us know if you have any updates or if you can share the dataset. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997924540


   Hi @nsivabalan. I am talking with our DPO about modifications we have to make in our dataset before sending it to you. As soon as I get his approval I will share the it with you.
   
   Meanwhile I can share the code I am using:
   
   ```hudi_options = {
   'hoodie.table.name': 'hudi_table',
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'UUID',
   'hoodie.datasource.write.partitionpath.field': 'PARTITION',
   'hoodie.datasource.write.precombine.field': 'SORT_KEY',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'sandbox',
   'hoodie.datasource.hive_sync.table': 'hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'PARTITION',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.parquet.compression.codec' : 'snappy',
   'hoodie.bulkinsert.shuffle.parallelism': 100
   }
   
   
   df.write\
     .format("hudi")\
     .options(**hudi_options)\
     .mode('overwrite')\
     .save('s3://s3-buckete/s3-folder')
     ```
     
     The environment still the same as a said in the first message.
     
     Thanks for your attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997924540


   Hi @nsivabalan. I am talking with our DPO about modifications we have to make in our dataset before sending it to you. As soon as I get his approval, I will share it with you.
   
   Meanwhile I can share the code I am using:
   
   ```python
   hudi_options = {
   'hoodie.table.name': 'hudi_table',
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'UUID',
   'hoodie.datasource.write.partitionpath.field': 'PARTITION',
   'hoodie.datasource.write.precombine.field': 'SORT_KEY',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'sandbox',
   'hoodie.datasource.hive_sync.table': 'hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'PARTITION',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.parquet.compression.codec' : 'snappy',
   'hoodie.bulkinsert.shuffle.parallelism': 100
   }
   
   
   df.write\
     .format("hudi")\
     .options(**hudi_options)\
     .mode('overwrite')\
     .save('s3://s3-buckete/s3-folder')
     ```
     
     The environment still the same as a said in the first message.
     
     Thanks for your attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1007434803


   Hi @nsivabalan. 
   [HERE](https://drive.google.com/file/d/1RsesivvlLUZ9dZh7WbaGJJpnqIcWNbso/view?usp=sharing) is the dataset used to replicate this problem. The file is not public, but will give access as soon as you request.
   
   Here is the code that initiate the spark session, maybe it will be useful for you:
   ```
   spark = (
       SparkSession.builder.appName("Hudi_Data_Processing_Framework")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.sql.hive.convertMetastoreParquet", "false")
       .config("spark.jars.packages","org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0")
       .config("spark.executor.memory", "4G")
       .config("spark.executor.cores","2")
       .enableHiveSupport()
       .getOrCreate()
   )
   ```
   
   Here is the code used to load this dataset:
   ```
   df = spark.read.load('<<dataset_path>>',
                          encoding='utf-8',
                          format='com.databricks.spark.csv',
                          header=True,
                          delimiter=';',
                          inferSchema=True)
   ```
   Sorry for the delay. I hope it can help you identify the problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3879:
URL: https://github.com/apache/hudi/issues/3879


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1012636426


   thanks. have placed a request. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1007434803


   Hi @nsivabalan. 
   [HERE](https://drive.google.com/file/d/1RsesivvlLUZ9dZh7WbaGJJpnqIcWNbso/view?usp=sharing) is the dataset used to replicate this problem. The file is not public, but will give access as soon as you request.
   
   Here is the code that initiate the spark session, maybe it will be useful for you:
   ```
   spark = (
       SparkSession.builder.appName("Hudi_Data_Processing_Framework")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.sql.hive.convertMetastoreParquet", "false")
       .config("spark.jars.packages","org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0")
       .config("spark.executor.memory", "4G")
       .config("spark.executor.cores","2")
       .enableHiveSupport()
       .getOrCreate()
   )
   ```
   
   Here is the code used to load this dataset:
   ```
   df = spark.read.load('<<dataset_path>>',
                          encoding='utf-8',
                          format='com.databricks.spark.csv',
                          header=True,
                          delimiter=';',
                          inferSchema=True)
   ```
   Sorry for the daley. I hope it can help you identify the problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-966607812


   Hi @nsivabalan. Thanks for your reply.  I did what you said and the issue went way, thanks a lot. Does this setting impact anything in the table? Can I make all Hudi operations without any problem? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1025687351


   Hi @nsivabalan. There is a partition column called `PARTITION`. As it is a partition column you may have to load the dataset using relative path, `./anonymous_dataset`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3879:
URL: https://github.com/apache/hudi/issues/3879


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-973158942


   Hi @nsivabalan, as I said before I could migrate others tables without any problem, just in a few tables a had this problem. I will try to create a dataset to reproduce this problem. How can I share it with you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1007434803


   Hi @nsivabalan. 
   [HERE](https://drive.google.com/file/d/1RsesivvlLUZ9dZh7WbaGJJpnqIcWNbso/view?usp=sharing) is the dataset used to replicate this problem. The file is not public, but will give access as soon as you request.
   
   Here is the code that initiate the spark session, maybe it will be useful for you:
   ```
   spark = (
       SparkSession.builder.appName("Hudi_Data_Processing_Framework")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.sql.hive.convertMetastoreParquet", "false")
       .config("spark.jars.packages","org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0")
       .config("spark.executor.memory", "4G")
       .config("spark.executor.cores","2")
       .enableHiveSupport()
       .getOrCreate()
   )
   ```
   
   Here is the code used to load this dataset:
   ```
   df = spark.read.load('<<dataset_path>>',
                          encoding='utf-8',
                          format='com.databricks.spark.csv',
                          header=True,
                          delimiter=';',
                          inferSchema=True)
   ```
   Sorry for the daley. I hope it can help you identify the problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893


   May I know whats the partition path field I should be choosing while writing to hudi ? 
   
   ```
   spark.sql("describe tbl1").show()
   +--------+---------+-------+
   |col_name|data_type|comment|
   +--------+---------+-------+
   |    UUID|   string|   null|
   |       A|   string|   null|
   |       B|timestamp|   null|
   |       C|   string|   null|
   |       D|      int|   null|
   |       E|timestamp|   null|
   |       F|   string|   null|
   |       G|timestamp|   null|
   |SORT_KEY|timestamp|   null|
   +--------+---------+-------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1039651290


   Closing this as not reproducible for now. But would definitely curious on this end. If you can help us w/ a dataset to reproduce the issue, do let us know. we take data missing very seriously. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-968186419


   @jardel-lima : you may see some perf hit, but you should be fine. 
   Is it possible to give us a reproducible code snippet or data may be(please mask any PII). We have not seen this issue before. And definitely wanted to dig in if there are any data loss. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1002862574


   sure. let us know once you have the dataset available to share. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1039651290


   Closing this as not reproducible for now. But would definitely curious on this end. If you can help us w/ a dataset to reproduce the issue, do let us know. we take data missing very seriously. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1032016379


   I could not reproduce the missing records. 
   
   ```
   val df = spark.read.option("encoding","utf-8").option("header","true").option("inferSchema","true").option("delimiter",";").format("com.databricks.spark.csv").load("/Users/nsb/Downloads/anonymous_sample_table/")
   
   df.count
   res0: Long = 10000
   
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_cow"
   val basePath = "/tmp/hudi_trips_cow5"
   
   // upsert operation
   
   
   df.write.format("hudi").
     options(getQuickstartWriteConfigs).
     option(PRECOMBINE_FIELD_OPT_KEY, "SORT_KEY").
     option(RECORDKEY_FIELD_OPT_KEY, "UUID").
     option(PARTITIONPATH_FIELD_OPT_KEY, "PARTITION").
     option(TABLE_NAME, tableName).
     mode(Append).save(basePath)
   
   
   val hudiDf = spark.read.format("hudi").load(basePath)
   hudiDf.count
   res4: Long = 10000
   
   
   // bulk insert operation
   
   val basePath = "/tmp/hudi_trips_cow6"
   
   df.write.format("hudi").
     options(getQuickstartWriteConfigs).
     option(PRECOMBINE_FIELD_OPT_KEY, "SORT_KEY").
     option("hoodie.datasource.write.operation","bulk_insert").
     option(RECORDKEY_FIELD_OPT_KEY, "UUID").
     option(PARTITIONPATH_FIELD_OPT_KEY, "PARTITION").
     option(TABLE_NAME, tableName).
     mode(Append).save(basePath)
   
   val hudiDf = spark.read.format("hudi").load(basePath)
   hudiDf.count
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997924540


   Hi @nsivabalan. I am talking with our DPO about modifications we have to make in our dataset before sending it to you. As soon as I get his approval, I will share it with you.
   
   Meanwhile I can share the code I am using:
   
   ```hudi_options = {
   'hoodie.table.name': 'hudi_table',
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'UUID',
   'hoodie.datasource.write.partitionpath.field': 'PARTITION',
   'hoodie.datasource.write.precombine.field': 'SORT_KEY',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'sandbox',
   'hoodie.datasource.hive_sync.table': 'hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'PARTITION',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.parquet.compression.codec' : 'snappy',
   'hoodie.bulkinsert.shuffle.parallelism': 100
   }
   
   
   df.write\
     .format("hudi")\
     .options(**hudi_options)\
     .mode('overwrite')\
     .save('s3://s3-buckete/s3-folder')
     ```
     
     The environment still the same as a said in the first message.
     
     Thanks for your attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997924540


   Hi @nsivabalan. I am talking with our DPO about modifications we have to make in our dataset before sending it to you. As soon as I get his approval I will share the it with you.
   
   Meanwhile I can share the code I am using:
   
   ```hudi_options = {
   'hoodie.table.name': 'hudi_table',
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'UUID',
   'hoodie.datasource.write.partitionpath.field': 'PARTITION',
   'hoodie.datasource.write.precombine.field': 'SORT_KEY',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'sandbox',
   'hoodie.datasource.hive_sync.table': 'hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'PARTITION',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.parquet.compression.codec' : 'snappy',
   'hoodie.bulkinsert.shuffle.parallelism': 100
   }
   
   
   df.write\
     .format("hudi")\
     .options(**hudi_options)\
     .mode('overwrite')\
     .save('s3://s3-buckete/s3-folder')```
     
     The environment still the same as a said in the first message.
     
     Thanks for your attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-965670734


   this seems very strange, that same no of records are missing everytime. Can you try by disabling row writer (`hoodie.datasource.write.row.writer.enable`) to see if the issue goes away? I don't suspect this to be an issue, but just in case.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893


   May I know whats the partition path field I should be choosing while writing to hudi ? and I assume record key is UUID and preCombine field is SORT_KEY. 
   
   ```
   spark.sql("describe tbl1").show()
   +--------+---------+-------+
   |col_name|data_type|comment|
   +--------+---------+-------+
   |    UUID|   string|   null|
   |       A|   string|   null|
   |       B|timestamp|   null|
   |       C|   string|   null|
   |       D|      int|   null|
   |       E|timestamp|   null|
   |       F|   string|   null|
   |       G|timestamp|   null|
   |SORT_KEY|timestamp|   null|
   +--------+---------+-------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-997924540


   Hi @nsivabalan. I am talking with our DPO about modifications we have to make in our dataset before sending it to you. As soon as I get his approval, I will share it with you.
   
   Meanwhile I can share the code I am using:
   
   ```
   hudi_options = {
   'hoodie.table.name': 'hudi_table',
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.recordkey.field': 'UUID',
   'hoodie.datasource.write.partitionpath.field': 'PARTITION',
   'hoodie.datasource.write.precombine.field': 'SORT_KEY',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'sandbox',
   'hoodie.datasource.hive_sync.table': 'hudi_table',
   'hoodie.datasource.hive_sync.partition_fields': 'PARTITION',
   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.support_timestamp': 'true',
   'hoodie.parquet.compression.codec' : 'snappy',
   'hoodie.bulkinsert.shuffle.parallelism': 100
   }
   
   
   df.write\
     .format("hudi")\
     .options(**hudi_options)\
     .mode('overwrite')\
     .save('s3://s3-buckete/s3-folder')
     ```
     
     The environment still the same as a said in the first message.
     
     Thanks for your attention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1018512485


   @jardel-lima : gentle ping. may I know which field I should be using as partition path. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jardel-lima edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration

Posted by GitBox <gi...@apache.org>.
jardel-lima edited a comment on issue #3879:
URL: https://github.com/apache/hudi/issues/3879#issuecomment-1007434803


   Hi @nsivabalan. 
   [HERE](https://drive.google.com/file/d/1RsesivvlLUZ9dZh7WbaGJJpnqIcWNbso/view?usp=sharing) is the dataset used to replicate this problem. The file is not public, but will give access as soon as you request.
   
   Here is the code that initiate the spark session, maybe it will be useful for you:
   ```python
   spark = (
       SparkSession.builder.appName("Hudi_Data_Processing_Framework")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.sql.hive.convertMetastoreParquet", "false")
       .config("spark.jars.packages","org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0")
       .config("spark.executor.memory", "4G")
       .config("spark.executor.cores","2")
       .enableHiveSupport()
       .getOrCreate()
   )
   ```
   
   Here is the code used to load this dataset:
   ```python
   df = spark.read.load('<<dataset_path>>',
                          encoding='utf-8',
                          format='com.databricks.spark.csv',
                          header=True,
                          delimiter=';',
                          inferSchema=True)
   ```
   Sorry for the delay. I hope it can help you identify the problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org