You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "menna224 (via GitHub)" <gi...@apache.org> on 2023/02/08 13:15:41 UTC

[GitHub] [hudi] menna224 opened a new issue, #7897: the compaction of the MOR hudi table keeps the old values

menna224 opened a new issue, #7897:
URL: https://github.com/apache/hudi/issues/7897

   I am having a glue job in which I write to hudi table, and I write it as MOR here's the config:
   
   conf = {
       'className': 'org.apache.hudi',
       'hoodie.table.name': hudi_table_name,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
       'hoodie.datasource.write.precombine.field': 'timestamp',
       'hoodie.datasource.write.recordkey.field': 'user_id',
       #'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE,month:SIMPLE,day:SIMPLE',
       #'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
       #'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
       #'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-mm-dd',
       #'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd'
   }
   
   hudiGlueConfig = {
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
       'hoodie.datasource.hive_sync.database': database_name,
       'hoodie.datasource.hive_sync.table': hudi_table_name,
       'hoodie.datasource.hive_sync.use_jdbc': 'false',
       #'hoodie.datasource.write.hive_style_partitioning': 'false',
       #'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       #'hoodie.datasource.hive_sync.partition_fields': 'year,month,day'
   }
   
   config_={**conf, **hudiGlueConfig}
   
   I noticed that for each new record I append I had parquet file, and when I update any of them I have a log file contains the update, and after number of appends the parqeut files compacted into one parquet file. However, this file contains the old values of initially added records, not the updated ones, any clue what I might be doing wrong?
   
   the rt view reflects the correct data, the ro doesn't.
   
   I am writing it as:
   glueContext.forEachBatch( frame=data_frame_DataSource0, batch_function=processBatch, options={ "windowSize": window_size, "checkpointLocation": s3_path_spark } )
   
       glueContext.write_dynamic_frame.from_options(
           frame=DynamicFrame.fromDF(df, glueContext, "df"),
           connection_type="custom.spark",
           connection_options=config_
       )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1427512421

   The file removing timing depends on youe cleaning strategy, by default it keeps about 30 commits on the timeline, take https://hudi.apache.org/docs/hoodie_cleaner for a reference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1454333905

hey @menna224 :
let me clarify something and then will ask some clarification.

Commit1:
Key1, val1 : file1_v1.parquet.

Commit2:
key2, val2: file1_v2.parquet

both file1_v1 and file1_v2 belongs to same file group. When you do read query, hudi will only read file1_v2.parquet. this is due to small file handling. Cleaner when its get executed later, will clean up file1_v1.parquet. but once file1_v2.parquet is created, none of your snapshot queries will read from file1_v1.

Commit3:
key3, val3.: again due to small file handling, file1_v3.parquet.

Commit4:
key3, val4 (same key as before, but an update)
Hudi will add a log file to file1 (file group).

So, on disk
its file1_v3.parquet and log_file1.parquet.

with rt, hudi will read both of them, merge and server.
incase of ro, hudi will read just file1_v3.parquet.

Lets say, we keep adding more updates for key3. more log files will be added.
once compaction kicks in, a new parquet file will be created
file1_v4.parquet (which is a merged version of file1_v3 + all associated log files).

Can you clarify whats the issue you are seeing. your example wasn't very clear for me.
esply on these statements.
```
then after the 10th update where i changed the name to "joe", I can see 10 log files, and only 1 parquet file, the parquet file that is kept is the last one (file3.parquet) with the old values not the updates ones:
(id=3,name=mg)
(id=4,name=sa)
(id=5,name=john)

and file1.parquet &file2.parquet were delted.
rt table contained the right values (the three records and the last record has a value joe for the coloum name)
ro contained the values that's in the parquet
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1426629317

   > I noticed that for each new record I append I had parquet file,so, first parquet has the first record, then when i insert new row a second parquet file created with both records, and when I insert for the third time a third parquet file is created with the 3 rows and when I update any of them I have a log file contains the update, and after number of appends the parqeut files compacted into one parquet file(the newest parquet file is kept (which has the three records appended) however , the other two parquet files are removed. 
   
   This is actually how the `BLOOM_FILTER` index works, all the inserts are written into a new FileSlice, only delta updates are written into logs.(Because you know, for UPDATEs, Hudi needs to know where its old records are located). And there are also small file/fileSlice strategy here so that things are kind of more complex, like you have perceived that new records are written into the same file group.
   
   The rt view would merge all the base parquet and delta logs so that the result is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] menna224 commented on issue #7897: the compaction of the MOR hudi table keeps the old values

Posted by "menna224 (via GitHub)" <gi...@apache.org>.

menna224 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1424063815

   update: this issue was with glue version 3 and hudi connector from AWS market place.
   but the same happened when I tried glue 4 and I used a jar for hudi version [hudi-spark3.3-bundle_2.12-0.12.2.jar]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] menna224 commented on issue #7897: the compaction of the MOR hudi table keeps the old values

Posted by "menna224 (via GitHub)" <gi...@apache.org>.

menna224 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1424274329

   @nsivabalan can you please help in this? :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] menna224 commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values

Posted by "menna224 (via GitHub)" <gi...@apache.org>.

menna224 commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1426911134

   Thank you @danny0405  for your response. I am aware that rt would reflect it and indeed it reflected the changes instantly, but as far as I understand that the when compaction is done  ro will fetch the changes, when will it happen or how? I was expecting that after the 10 commits and the parquet files deleted and only one was left should be reflected? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org