You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/29 07:53:30 UTC

[GitHub] [hudi] reenarosid opened a new issue #1885: [SUPPORT] MISSING RECORDS

reenarosid opened a new issue #1885:
URL: https://github.com/apache/hudi/issues/1885


   
   Issue: I made a huge insert into hudi Table, but only 10th of the records were inserted. 
   To add more, I was having a partitionless dataset.
   I also made sure that de-duplication was on False ( i know by default it was set false, just to ensure I made an explicit statement).
   Below are the set of commanda that i executed .
   
   
   df = spark.read.parquet(PATH+"/*")
   # took 2000 records of a dataset
   df1=df.limit(2000)
   # took 1000 of the above and inserted it first and then tried appending the rest. ( ensuring duplicates)
   set1= df1.limit(1000)
     
   First insert was set1, then I tried inserting df1(a superset of set1) .
   
   hudi_options = {
     'hoodie.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.recordkey.field': 'f1',
    "hoodie.datasource.write.insert.drop.duplicates":"false",
     'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'f1',
     'hoodie.upsert.shuffle.parallelism': 1,
     'hoodie.insert.shuffle.parallelism': 1,
     "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
     'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
     "hoodie.cleaner.commits.retained": "1",
     "hoodie.cleaner.fileversions.retained": "1",
     "hoodie.parquet.min.file.size":6221225472,
   }
   
   
   set1.write.format("org.apache.hudi"). \
     options(**hudi_options). \
     mode("overwrite"). \
     save(HUDI_PATH)
   
   ----------------------- second insertion -----
   hudi_options = {
     'hoodie.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.recordkey.field': 'f1',
    "hoodie.datasource.write.insert.drop.duplicates":"false",
     'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
     'hoodie.datasource.write.operation': 'upsert',
     'hoodie.datasource.write.precombine.field': 'f2',
     'hoodie.upsert.shuffle.parallelism': 1,
     'hoodie.insert.shuffle.parallelism': 1,
     "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
     'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
     "hoodie.cleaner.commits.retained": "1",
     "hoodie.cleaner.fileversions.retained": "1",
     "hoodie.parquet.min.file.size":6221225472,
   }
   
   df1.write.format("org.apache.hudi"). \
     options(**hudi_options). \
     mode("append"). \
     save(HUDI_PATH)
   
   
   But when I look at the count I see that only a few records were inserted. ( 1043 instead 3000 in my case).
    Field f1 had been duplicated in my data source.
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] reenarosid edited a comment on issue #1885: [SUPPORT] MISSING RECORDS

Posted by GitBox <gi...@apache.org>.
reenarosid edited a comment on issue #1885:
URL: https://github.com/apache/hudi/issues/1885#issuecomment-665822501






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha edited a comment on issue #1885: [SUPPORT] MISSING RECORDS

Posted by GitBox <gi...@apache.org>.
satishkotha edited a comment on issue #1885:
URL: https://github.com/apache/hudi/issues/1885#issuecomment-665909566


   If all columns in the table are part of 'key', updates do not really make sense. Update usually means we want to change 'value' associated with key. You can consider deleting previous key and inserting new key combination.  Is this a real scenario to have just 2 columns, both part of the key? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on issue #1885: [SUPPORT] MISSING RECORDS

Posted by GitBox <gi...@apache.org>.
satishkotha commented on issue #1885:
URL: https://github.com/apache/hudi/issues/1885#issuecomment-665812122






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] reenarosid closed issue #1885: [SUPPORT] MISSING RECORDS

Posted by GitBox <gi...@apache.org>.
reenarosid closed issue #1885:
URL: https://github.com/apache/hudi/issues/1885


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] reenarosid commented on issue #1885: [SUPPORT] MISSING RECORDS

Posted by GitBox <gi...@apache.org>.
reenarosid commented on issue #1885:
URL: https://github.com/apache/hudi/issues/1885#issuecomment-665822501






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org